BEWARE THE TROLLS: Red-teaming for LLM-powered apps

Ścieżka

AI / ML

Rodzaj

Talk

Poziom

intermediate

Język

English

Miejsce

Main Stream

Początek

2025-12-05T18:50:00Z

Koniec

2025-12-05T19:10:00Z

Czas trwania

20 minut

Abstrakt

Large Language Models are trained to produce helpful, safe answers—but what happens when someone tries to make them misbehave? Malicious users can manipulate prompts to generate unsafe content like hate speech or violent instructions. Every time we add an LLM to an app, we open up that risk. In this talk, I’ll show how to use Python to red-team an LLM-powered app: simulating hundreds of bad actors to see how the system holds up. We’ll explore public datasets of adversarial prompts, and use the open-source pyrit package to obfuscate attacks with strategies like base-64 encoding and Caesar cipher. Finally, we’ll evaluate which attacks succeed, using another LLM to score the results. Attendees will walk away with a practical understanding of how to stress-test their own apps and a toolkit for keeping them safer against trolls.

Prelegentki i prelegenci

Pamela Fox
Microsoft