AI safety evaluations aren’t optional anymore. Before deploying any LLM, you need to know if it will refuse harmful requests, resist jailbreaks, avoid toxic outputs, and maintain factual accuracy under adversarial conditions.
The fastest way to run comprehensive safety benchmarks is with Inspect AI for modern test suites and lm-evaluation-harness for established academic benchmarks. Use Inspect AI when you want flexible custom evals with built-in safety tests. Use lm-evaluation-harness when you need standardized results comparable to published research.
Here’s how to run both, plus custom eval patterns that catch real-world safety failures.
Running Safety Evals with Inspect AI
Inspect AI is the current best-in-class framework for safety evaluations. It has built-in tasks for toxicity, refusal testing, and jailbreak resistance.
Install and run toxicity benchmarks on GPT-4:
| |
This runs the ToxicityBench dataset, which tests whether the model generates toxic content when prompted with edge cases. The output shows pass rates, failure examples, and toxicity scores.
To test refusal behavior (does the model refuse harmful requests?):
| |
The --log-dir flag saves detailed logs with every prompt, response, and grading decision. Check safety-logs/ for JSON files with full transcripts.
Custom Safety Eval with Inspect AI
Build your own eval suite to test domain-specific safety requirements. This example tests if a medical chatbot refuses to diagnose conditions:
| |
Create medical_refusal_prompts.json with test cases:
| |
The model_graded_fact scorer uses another LLM (default GPT-4) to judge if responses properly refuse diagnosis requests. This is faster and more consistent than human grading for large eval sets.
Running Academic Benchmarks with lm-evaluation-harness
For standardized benchmarks cited in research papers, use lm-evaluation-harness. It supports TruthfulQA (factual accuracy), BBQ (bias), and BOLD (toxicity).
Install and run TruthfulQA on a local Llama model:
| |
This tests factual accuracy by asking questions where models often hallucinate (e.g., “What happens if you crack your knuckles?”). The truthfulqa_mc2 variant uses multiple-choice format for easier grading.
Run bias benchmarks with BBQ (Bias Benchmark for QA):
| |
BBQ tests for social biases across race, gender, religion, and disability. The bbq_lite_json task runs a subset of 500 questions (full BBQ has 58,000+).
Testing Jailbreak Resistance
Jailbreak resistance requires adversarial prompts designed to bypass safety guardrails. Use the JailbreakBench dataset with Inspect AI:
| |
This runs 100 adversarial prompts (DAN, roleplay attacks, encoding tricks) and scores how often the model complies with harmful requests. A good refusal rate is >95%.
For custom jailbreak tests, build a dataset of domain-specific attacks:
| |
The match scorer checks if responses contain refusal language. Adjust the regex pattern to match your model’s refusal phrasing.
Toxicity Detection at Scale
For production monitoring, run toxicity classification on model outputs. Use Perspective API or local classifiers:
| |
Perspective API is free for <1M requests/month. For offline evaluation, use open-source toxicity classifiers like detoxify:
| |
| |
Common Errors and Fixes
“API key not found” when running Inspect AI evals
Set the environment variable for your provider:
| |
Or pass it directly in the eval call:
| |
lm-evaluation-harness fails with “CUDA out of memory” on local models
Reduce batch size or enable tensor parallelism:
| |
For 70B models, use 4 GPUs with tensor_parallel_size=4. For 8B models, batch_size=8 works on a single A100.
Perspective API returns “rate limit exceeded”
The free tier limits to 1 QPS (query per second). Add rate limiting:
| |
Or batch requests using the languages field to check multiple texts (still counts as 1 request).
TruthfulQA scores look too low compared to published results
Ensure you’re using the same task variant. truthfulqa_mc1 is harder (single correct answer) than truthfulqa_mc2 (multiple correct answers). Published GPT-4 scores usually cite mc2.
Also check if the paper used few-shot examples:
| |
Custom Inspect AI scorers always return 0.0
Debug by printing scorer inputs:
| |
Common issue: the target field in your dataset doesn’t match what the scorer expects. For model-graded scorers, the target is often ignored—the grading prompt is what matters.
Related Guides
- How to Build Automated Output Safety Classifiers for LLM Apps
- How to Build Automated Fairness Testing for LLM-Generated Content
- How to Build Hallucination Scoring and Grounding Verification for LLMs
- How to Build Automated Prompt Leakage Detection for LLM Apps
- How to Build Automated PII Redaction Testing for LLM Outputs
- How to Build Automated Age and Content Gating for AI Applications
- How to Build Automated Data Retention and Deletion for AI Systems
- How to Build Automated Jailbreak Detection for LLM Applications
- How to Build Automated Bias Audits for LLM Outputs
- How to Build Automated Hate Speech Detection with Guardrails