The Quick Version
Red-teaming means attacking your own LLM application with adversarial prompts before someone else does. You’re looking for prompt injection, jailbreaks, data extraction, and every other failure mode listed in the OWASP Top 10 for LLM Applications.
Two tools dominate this space: Microsoft PyRIT (Python Risk Identification Toolkit) for programmatic, orchestrated attacks, and NVIDIA Garak for broad vulnerability scanning from the command line. Both are open source. You should use both — they catch different things.
Here’s a Garak scan against a local endpoint in under a minute:
| |
That hits your endpoint with base64-encoded injection attempts and the DAN 11.0 jailbreak. If your app responds to either, you have a problem.
OWASP LLM Top 10: What You’re Testing For
The OWASP Top 10 for LLM Applications (v2025) gives you the threat model. Every red-team engagement should map back to these categories:
| # | Vulnerability | What It Means |
|---|---|---|
| LLM01 | Prompt Injection | Attacker overrides system instructions via user input |
| LLM02 | Sensitive Information Disclosure | Model leaks training data, PII, or system prompts |
| LLM03 | Supply Chain | Compromised models, plugins, or training data |
| LLM04 | Data and Model Poisoning | Tampered training data leads to manipulated outputs |
| LLM05 | Improper Output Handling | Model output executed as code or trusted without validation |
| LLM06 | Excessive Agency | Model has too many permissions, takes unintended actions |
| LLM07 | System Prompt Leakage | Attacker extracts the system prompt verbatim |
| LLM08 | Vector and Embedding Weaknesses | Poisoned embeddings in RAG pipelines |
| LLM09 | Misinformation | Model generates convincing but false information |
| LLM10 | Unbounded Consumption | Denial of service through resource-exhausting prompts |
Your red-team tests should cover at least LLM01, LLM02, LLM05, and LLM07. Those are the ones attackers hit first in production.
Setting Up PyRIT
PyRIT is Microsoft’s red-teaming orchestration framework. It’s more than a list of attacks — it uses an attacker LLM to dynamically generate and refine adversarial prompts against your target.
| |
PyRIT requires Python 3.10+ and needs access to an LLM for its attacker (it calls this the “red-teaming orchestrator”). You can use Azure OpenAI, OpenAI directly, or a local model.
Create a .env file for your API keys:
| |
Now write a basic red-team script that tests your application endpoint for prompt injection:
| |
That’s the minimum viable red-team. PyRIT sends each prompt, records the response, and prints the full conversation so you can review what your app leaked or complied with.
Multi-Turn Attacks with PyRIT
Single-shot prompts are easy to block. Real attackers use multi-turn conversations to gradually steer the model. PyRIT’s RedTeamingOrchestrator automates this — it uses an attacker LLM to craft follow-up prompts based on the target’s responses.
| |
This is where PyRIT shines. The attacker LLM tries different angles — reframing, role-playing, encoding tricks — across up to 5 turns. If your app leaks the system prompt on turn 3 after a gradual escalation, the scorer catches it.
Running Garak for Broad Vulnerability Scanning
Garak takes a different approach. Instead of orchestrated multi-turn attacks, it throws hundreds of known attack patterns at your model and reports which ones got through. Think of it as a vulnerability scanner, not a penetration tester.
| |
Garak supports OpenAI-compatible APIs, Hugging Face models, and local endpoints. Run a full scan:
| |
For a targeted scan, pick specific probe families:
| |
The --generations flag controls how many variations of each probe to try. Higher values catch more edge cases but take longer.
Garak outputs results to a JSON report and prints a summary to the console:
| |
Any FAIL means your model responded to an adversarial prompt when it shouldn’t have. Investigate each failure and add corresponding guardrails.
Interpreting Results and Fixing Vulnerabilities
When a red-team test succeeds (meaning the attack worked), you need to trace the failure and patch it. Here’s the workflow:
For prompt injection failures: Add input filtering that catches the specific pattern. Regex-based filters catch known attacks; an LLM-based classifier catches novel variations. Use both.
| |
For system prompt leakage: Never put secrets in system prompts. Separate sensitive configuration from the model’s instructions. Add an output filter that detects when the model starts quoting its own instructions.
For excessive agency (LLM06): Reduce the tools and permissions your model has access to. If your agent can execute arbitrary code, constrain it to a sandbox. If it can make API calls, restrict the endpoints.
For jailbreak compliance: Strengthen your system prompt with explicit refusal instructions, and add output-level content filtering as a backstop. No single layer is sufficient — defense in depth is the only approach that works.
Building a CI Red-Team Pipeline
Red-teaming shouldn’t be a one-time activity. Run a focused probe set on every deploy, the same way you run unit tests.
| |
Run it in CI:
| |
If any probe gets through, the script exits with code 1 and blocks the deploy. Start with 5-10 critical probes and expand as you discover new failure modes.
Common Errors and Fixes
ModuleNotFoundError: No module named 'pyrit' — PyRIT requires Python 3.10 or higher. Check your version with python3 --version. If you’re on 3.9 or below, upgrade or use a virtual environment with the right version:
| |
httpx.ConnectError: All connection attempts failed when targeting a local endpoint. Your target API isn’t running, or it’s on a different port. Verify it’s up with curl http://localhost:8000/v1/chat/completions before running your red-team script. If you’re inside a container, use the host’s IP instead of localhost.
openai.AuthenticationError: Error code: 401 in Garak. Garak reads API keys from environment variables. Export them before running:
| |
Garak hangs or runs extremely slowly. Large probe sets against rate-limited APIs will crawl. Use --generations 1 for initial scans and increase only for probes you want to test thoroughly. You can also target specific probe families instead of using --probes all.
ValueError: Scorer returned an invalid score in PyRIT. This happens when the scorer LLM returns a response the parser can’t interpret. Use a stronger model for scoring (GPT-4o works reliably) or simplify your scorer’s true_false_question to be more explicit about the expected format.
PyRIT MemoryError on long-running sessions. PyRIT stores all conversations in a local database. For large-scale tests with hundreds of prompts, set PYRIT_MEMORY_DB_TYPE=duckdb in your environment to use DuckDB instead of the default SQLite, which handles concurrent writes better.
PyRIT vs. Garak: Which to Use
Use both, but for different purposes.
Garak is your first pass. It’s fast, requires no code, and covers a wide range of known vulnerabilities out of the box. Run it early and often. It tells you if your model is vulnerable to documented attack patterns.
PyRIT is your deep dive. Its multi-turn orchestration finds vulnerabilities that static probe lists miss. The attacker LLM adapts its strategy based on your model’s responses, which is closer to what a real adversary does. Use it for critical applications before launch and when Garak finds weaknesses you need to understand better.
Neither tool replaces manual red-teaming by a skilled human. Automated tools find the known patterns. Humans find the creative, context-specific attacks that no tool has in its database. Budget time for both.
Related Guides
- How to Build Automated Jailbreak Detection for LLM Applications
- How to Build Automated Prompt Leakage Detection for LLM Apps
- How to Build Prompt Injection Detection for LLM Apps
- How to Implement Content Filtering for LLM Applications
- How to Add Guardrails to LLM Apps with NeMo Guardrails
- How to Implement Constitutional Classifiers to Harden Your LLM API Against Jailbreaks
- How to Build Automated Fairness Testing for LLM-Generated Content
- How to Build Adversarial Test Suites for ML Models
- How to Detect and Reduce Hallucinations in LLM Applications
- How to Build Automated PII Redaction Testing for LLM Outputs