Your LLM app will hallucinate. Not might – will. The model will confidently cite papers that don’t exist, invent API endpoints, and fabricate statistics with decimal-point precision. The question is whether you catch it before your users do.

Here’s a quick check you can run right now against any LLM output using an NLI (natural language inference) model:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
from transformers import pipeline

nli = pipeline("text-classification", model="cross-encoder/nli-deberta-v3-base")

source = "Python 3.12 was released in October 2023."
claim = "Python 3.12 came out in March 2024."

result = nli(f"{source} [SEP] {claim}")
print(result)
# [{'label': 'CONTRADICTION', 'score': 0.9847}]

That NLI model just caught a factual inconsistency in under a second. This is the foundation of automated hallucination detection – and we’re going to build on it.

Types of Hallucinations

Not all hallucinations are the same, and your detection strategy depends on which type you’re dealing with.

Intrinsic hallucinations contradict the source material. The model has the right context but generates something that conflicts with it. These are the easiest to catch because you have a ground truth to compare against. RAG apps suffer from this constantly – the retrieved documents say one thing, the model says another.

Extrinsic hallucinations add information that isn’t in the source at all. The model fills gaps with plausible-sounding fabrications. That invented citation? Extrinsic. That non-existent function parameter? Extrinsic. These are harder to detect because you’d need external knowledge to verify them.

Faithfulness failures happen when the model ignores its own context window. You stuff a 10-page document into the prompt, ask a question, and the model answers from its parametric memory instead of the provided text. This is technically an intrinsic hallucination, but it’s worth calling out separately because RAG pipelines are specifically designed to prevent it – and they still fail at it regularly.

Self-Consistency Detection

The cheapest detection method is asking the same question multiple times and checking whether the answers agree. If the model is confident about a fact, it’ll give you the same answer every time. If it’s hallucinating, the answers will drift.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import openai
from collections import Counter

client = openai.OpenAI()

def check_self_consistency(question: str, context: str, n_samples: int = 5) -> dict:
    """Sample multiple answers and check agreement."""
    answers = []
    for _ in range(n_samples):
        response = client.chat.completions.create(
            model="gpt-4o",
            temperature=1.0,  # High temperature to expose uncertainty
            messages=[
                {"role": "system", "content": f"Answer based ONLY on this context:\n{context}"},
                {"role": "user", "content": question},
            ],
            max_tokens=150,
        )
        answers.append(response.choices[0].message.content.strip())

    # Use an LLM to cluster semantically equivalent answers
    cluster_prompt = f"""Group these answers by semantic meaning. Return the number of distinct answer groups.
Answers: {answers}
Respond with just a number."""

    cluster_response = client.chat.completions.create(
        model="gpt-4o",
        temperature=0,
        messages=[{"role": "user", "content": cluster_prompt}],
    )
    n_groups = int(cluster_response.choices[0].message.content.strip())

    return {
        "answers": answers,
        "distinct_groups": n_groups,
        "is_consistent": n_groups == 1,
        "confidence": 1.0 / n_groups,  # Simple consistency score
    }

result = check_self_consistency(
    question="What is the maximum context window?",
    context="The model supports a 128K token context window for input.",
)
print(f"Consistent: {result['is_consistent']}, Confidence: {result['confidence']}")

Self-consistency works well for factoid questions but struggles with open-ended generation. If you ask “summarize this document” five times, you’ll get five different summaries that are all correct. Use this method for extractive QA and fact verification, not for creative tasks.

NLI-Based Hallucination Detection

Natural language inference is the most reliable automated method for catching intrinsic hallucinations. The idea is simple: treat the source document as the premise and each sentence of the LLM output as a hypothesis. If the NLI model labels it as a contradiction, you’ve found a hallucination.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
from transformers import pipeline
import re

nli = pipeline(
    "text-classification",
    model="cross-encoder/nli-deberta-v3-base",
    device=0,  # Use GPU if available, remove for CPU
)

def detect_hallucinations(source: str, generated: str, threshold: float = 0.7) -> list:
    """Check each generated sentence against the source for contradictions."""
    sentences = [s.strip() for s in re.split(r'[.!?]+', generated) if s.strip()]
    flagged = []

    for sentence in sentences:
        result = nli(f"{source} [SEP] {sentence}")
        label = result[0]["label"]
        score = result[0]["score"]

        if label == "CONTRADICTION" and score > threshold:
            flagged.append({
                "sentence": sentence,
                "label": label,
                "score": round(score, 4),
            })

    return flagged

source = """Tesla reported Q3 2025 revenue of $25.5 billion,
up 8% year-over-year. Net income was $1.85 billion.
The company delivered 435,000 vehicles in the quarter."""

generated = """Tesla's Q3 2025 revenue reached $25.5 billion,
a 12% increase from the previous year. They delivered
roughly 500,000 vehicles and posted a net loss."""

issues = detect_hallucinations(source, generated)
for issue in issues:
    print(f"HALLUCINATION: \"{issue['sentence']}\" (confidence: {issue['score']})")
# HALLUCINATION: "a 12% increase from the previous year" (confidence: 0.9312)
# HALLUCINATION: "They delivered roughly 500,000 vehicles and posted a net loss" (confidence: 0.9654)

The cross-encoder/nli-deberta-v3-base model is my top recommendation here. It runs fast enough for real-time checking (under 50ms per sentence on a decent GPU), handles domain-specific text reasonably well, and its three-way classification (entailment, neutral, contradiction) gives you a clean signal. For production, set the threshold at 0.7 or above to avoid false positives.

Guardrails and Validators

Once you can detect hallucinations, you need to block them before they reach users. Guardrails libraries let you wrap your LLM calls with validators that reject bad outputs automatically.

The Guardrails AI framework (guardrails-ai) provides a clean way to enforce factual consistency:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
from guardrails import Guard
from guardrails.hub import FactualConsistency

guard = Guard().use(
    FactualConsistency,
    on_fail="exception",  # Raise an error on hallucination
)

source_doc = "Our API rate limit is 100 requests per minute for free tier users."

try:
    result = guard(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": f"Answer using ONLY this info: {source_doc}"},
            {"role": "user", "content": "What are the rate limits?"},
        ],
        metadata={"reference": source_doc},
    )
    print(result.validated_output)
except Exception as e:
    print(f"Blocked hallucination: {e}")

You can also stack multiple validators. Pair FactualConsistency with RestrictToTopic to catch both hallucinated facts and off-topic drift. The on_fail parameter controls what happens when a validator trips – "exception" for hard stops, "reask" to retry with a corrective prompt, or "noop" to log and pass through.

Grounding with RAG

Retrieval-augmented generation is the single best architectural decision for reducing hallucinations. Instead of relying on the model’s parametric memory, you retrieve relevant documents and force the model to answer from them.

But RAG alone doesn’t eliminate hallucinations. You need to combine it with the detection methods above. Here’s the pattern that works:

  1. Retrieve relevant chunks from your vector store
  2. Generate an answer grounded in those chunks
  3. Verify the answer against the retrieved chunks using NLI
  4. Cite specific chunks so users can verify claims themselves
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
def grounded_generate(query: str, retriever, llm_client, nli_checker) -> dict:
    """RAG pipeline with built-in hallucination verification."""
    # Step 1: Retrieve
    chunks = retriever.search(query, top_k=5)
    context = "\n\n".join([c.text for c in chunks])

    # Step 2: Generate with citation instructions
    response = llm_client.chat.completions.create(
        model="gpt-4o",
        temperature=0.1,  # Low temperature for factual tasks
        messages=[
            {"role": "system", "content": (
                "Answer the question using ONLY the provided context. "
                "Cite sources with [1], [2], etc. "
                "If the context doesn't contain the answer, say so explicitly."
            )},
            {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"},
        ],
    )
    answer = response.choices[0].message.content

    # Step 3: Verify against source chunks
    hallucinations = nli_checker(source=context, generated=answer)

    return {
        "answer": answer,
        "sources": chunks,
        "hallucinations_detected": len(hallucinations),
        "flagged_claims": hallucinations,
        "is_grounded": len(hallucinations) == 0,
    }

Two things matter more than anything else for RAG groundedness. First, keep temperature low – 0.0 to 0.2 for factual tasks. Higher temperatures encourage the model to get creative, which is a polite way of saying “make things up.” Second, always include the escape hatch: tell the model it’s okay to say “I don’t know.” Models hallucinate most aggressively when they feel cornered into producing an answer.

Measuring Hallucination Rates

You can’t improve what you don’t measure. Set up a hallucination rate metric and track it over time.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import json
from datetime import datetime

def evaluate_hallucination_rate(
    test_cases: list[dict],
    detect_fn,
    output_path: str = "hallucination_report.json",
) -> dict:
    """Run hallucination detection over a test suite and compute rates."""
    results = []
    total_claims = 0
    hallucinated_claims = 0

    for case in test_cases:
        issues = detect_fn(source=case["source"], generated=case["output"])
        n_sentences = len(case["output"].split(". "))
        total_claims += n_sentences
        hallucinated_claims += len(issues)

        results.append({
            "query": case["query"],
            "hallucinations": len(issues),
            "total_sentences": n_sentences,
            "details": issues,
        })

    report = {
        "timestamp": datetime.now().isoformat(),
        "total_cases": len(test_cases),
        "total_claims": total_claims,
        "hallucinated_claims": hallucinated_claims,
        "hallucination_rate": round(hallucinated_claims / max(total_claims, 1), 4),
        "results": results,
    }

    with open(output_path, "w") as f:
        json.dump(report, f, indent=2)

    return report

Track this weekly. A good target for production RAG apps is under 5% hallucination rate on your test suite. If you’re above 10%, your retrieval pipeline probably needs work before you throw more detection at it.

The most useful breakdown is by query type. Factoid questions (“What is X?”) should have near-zero hallucination rates. Synthesis questions (“Compare X and Y”) will be higher. Multi-hop reasoning questions (“What caused X, and how did that affect Y?”) will be highest. If your factoid hallucination rate is high, fix your retrieval. If only multi-hop questions hallucinate, add chain-of-thought prompting with source citations.

Common Errors

NLI model flags everything as contradiction. You’re probably sending text that’s too long. DeBERTa-based NLI models work best with inputs under 512 tokens. Split your source into paragraphs and check each one individually against the claim.

Self-consistency check shows inconsistency on every question. Make sure you’re using temperature=1.0 for sampling but comparing answers semantically, not with string matching. Two answers can be worded completely differently and still agree factually.

Guardrails validator throws ModuleNotFoundError. Guardrails Hub validators are installed separately. Run guardrails hub install hub://guardrails/factual_consistency before importing.

RAG pipeline still hallucinating despite good retrieval. Check whether the model is actually using the retrieved context. Add an instruction like “Quote the specific text you’re basing your answer on” and verify the quotes exist in the source. If the model invents quotes, you have a faithfulness problem – lower the temperature and consider a smaller, more steerable model.

Hallucination rate metric is suspiciously low. Your test suite probably doesn’t have enough adversarial cases. Add questions where the answer isn’t in the source material, questions that require reasoning across multiple chunks, and questions with common misconceptions that the model might parrot from training data.