Your LLM app will hallucinate. Not might – will. The model will confidently cite papers that don’t exist, invent API endpoints, and fabricate statistics with decimal-point precision. The question is whether you catch it before your users do.
Here’s a quick check you can run right now against any LLM output using an NLI (natural language inference) model:
| |
That NLI model just caught a factual inconsistency in under a second. This is the foundation of automated hallucination detection – and we’re going to build on it.
Types of Hallucinations
Not all hallucinations are the same, and your detection strategy depends on which type you’re dealing with.
Intrinsic hallucinations contradict the source material. The model has the right context but generates something that conflicts with it. These are the easiest to catch because you have a ground truth to compare against. RAG apps suffer from this constantly – the retrieved documents say one thing, the model says another.
Extrinsic hallucinations add information that isn’t in the source at all. The model fills gaps with plausible-sounding fabrications. That invented citation? Extrinsic. That non-existent function parameter? Extrinsic. These are harder to detect because you’d need external knowledge to verify them.
Faithfulness failures happen when the model ignores its own context window. You stuff a 10-page document into the prompt, ask a question, and the model answers from its parametric memory instead of the provided text. This is technically an intrinsic hallucination, but it’s worth calling out separately because RAG pipelines are specifically designed to prevent it – and they still fail at it regularly.
Self-Consistency Detection
The cheapest detection method is asking the same question multiple times and checking whether the answers agree. If the model is confident about a fact, it’ll give you the same answer every time. If it’s hallucinating, the answers will drift.
| |
Self-consistency works well for factoid questions but struggles with open-ended generation. If you ask “summarize this document” five times, you’ll get five different summaries that are all correct. Use this method for extractive QA and fact verification, not for creative tasks.
NLI-Based Hallucination Detection
Natural language inference is the most reliable automated method for catching intrinsic hallucinations. The idea is simple: treat the source document as the premise and each sentence of the LLM output as a hypothesis. If the NLI model labels it as a contradiction, you’ve found a hallucination.
| |
The cross-encoder/nli-deberta-v3-base model is my top recommendation here. It runs fast enough for real-time checking (under 50ms per sentence on a decent GPU), handles domain-specific text reasonably well, and its three-way classification (entailment, neutral, contradiction) gives you a clean signal. For production, set the threshold at 0.7 or above to avoid false positives.
Guardrails and Validators
Once you can detect hallucinations, you need to block them before they reach users. Guardrails libraries let you wrap your LLM calls with validators that reject bad outputs automatically.
The Guardrails AI framework (guardrails-ai) provides a clean way to enforce factual consistency:
| |
You can also stack multiple validators. Pair FactualConsistency with RestrictToTopic to catch both hallucinated facts and off-topic drift. The on_fail parameter controls what happens when a validator trips – "exception" for hard stops, "reask" to retry with a corrective prompt, or "noop" to log and pass through.
Grounding with RAG
Retrieval-augmented generation is the single best architectural decision for reducing hallucinations. Instead of relying on the model’s parametric memory, you retrieve relevant documents and force the model to answer from them.
But RAG alone doesn’t eliminate hallucinations. You need to combine it with the detection methods above. Here’s the pattern that works:
- Retrieve relevant chunks from your vector store
- Generate an answer grounded in those chunks
- Verify the answer against the retrieved chunks using NLI
- Cite specific chunks so users can verify claims themselves
| |
Two things matter more than anything else for RAG groundedness. First, keep temperature low – 0.0 to 0.2 for factual tasks. Higher temperatures encourage the model to get creative, which is a polite way of saying “make things up.” Second, always include the escape hatch: tell the model it’s okay to say “I don’t know.” Models hallucinate most aggressively when they feel cornered into producing an answer.
Measuring Hallucination Rates
You can’t improve what you don’t measure. Set up a hallucination rate metric and track it over time.
| |
Track this weekly. A good target for production RAG apps is under 5% hallucination rate on your test suite. If you’re above 10%, your retrieval pipeline probably needs work before you throw more detection at it.
The most useful breakdown is by query type. Factoid questions (“What is X?”) should have near-zero hallucination rates. Synthesis questions (“Compare X and Y”) will be higher. Multi-hop reasoning questions (“What caused X, and how did that affect Y?”) will be highest. If your factoid hallucination rate is high, fix your retrieval. If only multi-hop questions hallucinate, add chain-of-thought prompting with source citations.
Common Errors
NLI model flags everything as contradiction. You’re probably sending text that’s too long. DeBERTa-based NLI models work best with inputs under 512 tokens. Split your source into paragraphs and check each one individually against the claim.
Self-consistency check shows inconsistency on every question. Make sure you’re using temperature=1.0 for sampling but comparing answers semantically, not with string matching. Two answers can be worded completely differently and still agree factually.
Guardrails validator throws ModuleNotFoundError. Guardrails Hub validators are installed separately. Run guardrails hub install hub://guardrails/factual_consistency before importing.
RAG pipeline still hallucinating despite good retrieval. Check whether the model is actually using the retrieved context. Add an instruction like “Quote the specific text you’re basing your answer on” and verify the quotes exist in the source. If the model invents quotes, you have a faithfulness problem – lower the temperature and consider a smaller, more steerable model.
Hallucination rate metric is suspiciously low. Your test suite probably doesn’t have enough adversarial cases. Add questions where the answer isn’t in the source material, questions that require reasoning across multiple chunks, and questions with common misconceptions that the model might parrot from training data.
Related Guides
- How to Implement Constitutional AI Principles in LLM Applications
- How to Build Automated Jailbreak Detection for LLM Applications
- How to Implement Content Filtering for LLM Applications
- How to Build Automated Prompt Leakage Detection for LLM Apps
- How to Red-Team LLM Applications
- How to Implement Constitutional Classifiers to Harden Your LLM API Against Jailbreaks
- How to Implement AI Output Citations and Source Attribution
- How to Build Automated Output Safety Classifiers for LLM Apps
- How to Build Automated Fairness Testing for LLM-Generated Content
- How to Build Automated PII Redaction Testing for LLM Outputs