How to Build Hallucination Scoring and Grounding Verification for LLMs

LLMs hallucinate. That’s not a bug you can patch – it’s baked into how autoregressive generation works. If you’re building anything that touches real users, you need a scoring system that tells you how much of a generated response is actually grounded in your source documents.

NLI (Natural Language Inference) models solve this. They take a premise-hypothesis pair and classify it as entailment, contradiction, or neutral. Feed your source document as the premise and each sentence from the LLM output as the hypothesis, and you get a per-sentence grounding score.

Here’s the simplest version – score a single claim against a source passage:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "cross-encoder/nli-deberta-v3-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

source = "The Apollo 11 mission landed on the Moon on July 20, 1969."
claim = "Apollo 11 reached the Moon in July 1969."

inputs = tokenizer(source, claim, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
    logits = model(**inputs).logits

# Labels: 0 = contradiction, 1 = entailment, 2 = neutral
probs = torch.softmax(logits, dim=-1).squeeze()
labels = ["contradiction", "entailment", "neutral"]
for label, prob in zip(labels, probs):
    print(f"{label}: {prob:.4f}")
# entailment: 0.9712
# contradiction: 0.0031
# neutral: 0.0257

That gives you a probability distribution across all three classes. The entailment score is your grounding confidence.

NLI-Based Grounding Scores

The cross-encoder/nli-deberta-v3-base model works as a cross-encoder, meaning it processes both texts together through the full transformer stack. This gives you much better accuracy than bi-encoder approaches for this task.

Wrap the scoring logic into a reusable function:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from dataclasses import dataclass

model_name = "cross-encoder/nli-deberta-v3-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.eval()


@dataclass
class NLIResult:
    contradiction: float
    entailment: float
    neutral: float

    @property
    def is_grounded(self) -> bool:
        return self.entailment > self.contradiction and self.entailment > 0.5

    @property
    def is_hallucinated(self) -> bool:
        return self.contradiction > self.entailment and self.contradiction > 0.5


def score_claim(premise: str, hypothesis: str) -> NLIResult:
    """Score a single claim against a source passage using NLI."""
    inputs = tokenizer(
        premise, hypothesis,
        return_tensors="pt",
        truncation=True,
        max_length=512,
    )
    with torch.no_grad():
        logits = model(**inputs).logits
    probs = torch.softmax(logits, dim=-1).squeeze().tolist()
    # Model label order: contradiction=0, entailment=1, neutral=2
    return NLIResult(
        contradiction=probs[0],
        entailment=probs[1],
        neutral=probs[2],
    )


# Test it
source = "Python 3.12 was released in October 2023 with a new type statement."
claims = [
    "Python 3.12 came out in October 2023.",      # grounded
    "Python 3.12 was released in March 2024.",     # hallucinated
    "Python supports dynamic typing.",              # neutral (not in source)
]

for claim in claims:
    result = score_claim(source, claim)
    status = "GROUNDED" if result.is_grounded else (
        "HALLUCINATED" if result.is_hallucinated else "UNSUPPORTED"
    )
    print(f"[{status}] {claim}")
    print(f"  entailment={result.entailment:.3f}, "
          f"contradiction={result.contradiction:.3f}, "
          f"neutral={result.neutral:.3f}")

The key insight: neutral doesn’t mean “fine.” A neutral score means the source material neither supports nor contradicts the claim. For grounding verification, neutral claims are unsupported – the LLM added information that isn’t backed by your documents.

Sentence-Level Hallucination Scoring

Real LLM outputs are paragraphs, not single sentences. You need to break them apart and score each sentence independently. Some sentences will be grounded, others won’t – and you want to know exactly which ones failed.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
import nltk
nltk.download("punkt_tab", quiet=True)
from nltk.tokenize import sent_tokenize
from dataclasses import dataclass, field


@dataclass
class SentenceScore:
    sentence: str
    result: NLIResult
    best_source: str


@dataclass
class DocumentScore:
    sentence_scores: list[SentenceScore] = field(default_factory=list)

    @property
    def grounded_ratio(self) -> float:
        if not self.sentence_scores:
            return 0.0
        grounded = sum(1 for s in self.sentence_scores if s.result.is_grounded)
        return grounded / len(self.sentence_scores)

    @property
    def hallucination_ratio(self) -> float:
        if not self.sentence_scores:
            return 0.0
        hallucinated = sum(1 for s in self.sentence_scores if s.result.is_hallucinated)
        return hallucinated / len(self.sentence_scores)


def score_response(
    llm_output: str,
    source_passages: list[str],
) -> DocumentScore:
    """Score each sentence of an LLM response against source passages."""
    sentences = sent_tokenize(llm_output)
    doc_score = DocumentScore()

    for sentence in sentences:
        # Skip very short fragments
        if len(sentence.split()) < 3:
            continue

        best_entailment = -1.0
        best_result = None
        best_source = ""

        # Score against each source passage, keep the best match
        for passage in source_passages:
            result = score_claim(passage, sentence)
            if result.entailment > best_entailment:
                best_entailment = result.entailment
                best_result = result
                best_source = passage

        doc_score.sentence_scores.append(
            SentenceScore(
                sentence=sentence,
                result=best_result,
                best_source=best_source,
            )
        )

    return doc_score


# Example usage
sources = [
    "Tesla was founded in 2003 by Martin Eberhard and Marc Tarpenning.",
    "Elon Musk joined Tesla in 2004 as chairman of the board after leading the Series A funding round.",
]

llm_output = (
    "Tesla was founded in 2003 by Martin Eberhard and Marc Tarpenning. "
    "Elon Musk co-founded Tesla alongside them in 2003. "
    "The company went public in 2010 with a successful IPO."
)

doc = score_response(llm_output, sources)
for ss in doc.sentence_scores:
    tag = "GROUNDED" if ss.result.is_grounded else (
        "HALLUCINATED" if ss.result.is_hallucinated else "UNSUPPORTED"
    )
    print(f"[{tag}] {ss.sentence}")
    print(f"    entailment={ss.result.entailment:.3f}")

print(f"\nGrounded: {doc.grounded_ratio:.0%}")
print(f"Hallucinated: {doc.hallucination_ratio:.0%}")

This catches the classic “Elon Musk co-founded Tesla” hallucination. The source says he joined in 2004, not that he co-founded it. The NLI model flags the contradiction. The IPO sentence gets marked as unsupported – it may be true, but it’s not backed by the provided sources.

Aggregate Scoring and Thresholds

Per-sentence scores are useful for debugging, but you need aggregate numbers to make automated decisions. Here’s how to compute document-level hallucination scores and set actionable thresholds.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
from enum import Enum


class GroundingVerdict(Enum):
    PASS = "pass"
    WARN = "warn"
    FAIL = "fail"


@dataclass
class GroundingConfig:
    """Thresholds for grounding decisions."""
    min_grounded_ratio: float = 0.7     # At least 70% of sentences must be grounded
    max_hallucination_ratio: float = 0.1 # No more than 10% contradictions allowed
    warn_grounded_ratio: float = 0.85    # Warn if below 85% grounded


def evaluate_grounding(
    doc_score: DocumentScore,
    config: GroundingConfig = GroundingConfig(),
) -> tuple[GroundingVerdict, dict]:
    """Evaluate a document score against thresholds."""
    grounded = doc_score.grounded_ratio
    hallucinated = doc_score.hallucination_ratio
    total_sentences = len(doc_score.sentence_scores)

    # Collect flagged sentences for the report
    flagged = [
        {
            "sentence": ss.sentence,
            "entailment": round(ss.result.entailment, 3),
            "contradiction": round(ss.result.contradiction, 3),
        }
        for ss in doc_score.sentence_scores
        if ss.result.is_hallucinated or not ss.result.is_grounded
    ]

    report = {
        "total_sentences": total_sentences,
        "grounded_ratio": round(grounded, 3),
        "hallucination_ratio": round(hallucinated, 3),
        "flagged_sentences": flagged,
    }

    if hallucinated > config.max_hallucination_ratio:
        return GroundingVerdict.FAIL, report
    if grounded < config.min_grounded_ratio:
        return GroundingVerdict.FAIL, report
    if grounded < config.warn_grounded_ratio:
        return GroundingVerdict.WARN, report
    return GroundingVerdict.PASS, report


# Using the earlier doc_score
config = GroundingConfig(
    min_grounded_ratio=0.7,
    max_hallucination_ratio=0.1,
    warn_grounded_ratio=0.85,
)
verdict, report = evaluate_grounding(doc_score, config)
print(f"Verdict: {verdict.value}")
print(f"Grounded: {report['grounded_ratio']:.0%}")
print(f"Hallucinated: {report['hallucination_ratio']:.0%}")
print(f"Flagged sentences: {len(report['flagged_sentences'])}")

Pick your thresholds based on your risk tolerance. Medical and legal apps should set max_hallucination_ratio close to 0. A casual chatbot can tolerate more. Start strict and loosen up once you see how your model performs on real traffic.

Building a Verification Middleware

You probably don’t want to manually call scoring functions every time. Wrap everything into a middleware class that sits between your LLM call and the user.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
from dataclasses import dataclass, field
from typing import Optional
import json


@dataclass
class VerifiedResponse:
    text: str
    verdict: GroundingVerdict
    grounded_ratio: float
    hallucination_ratio: float
    flagged_sentences: list[dict]
    passed: bool


class HallucinationVerifier:
    """Middleware that scores LLM outputs against source documents."""

    def __init__(
        self,
        min_grounded: float = 0.7,
        max_hallucinated: float = 0.1,
    ):
        self.config = GroundingConfig(
            min_grounded_ratio=min_grounded,
            max_hallucination_ratio=max_hallucinated,
        )

    def verify(
        self,
        llm_output: str,
        source_passages: list[str],
    ) -> VerifiedResponse:
        """Score an LLM response and return a verified result."""
        doc_score = score_response(llm_output, source_passages)
        verdict, report = evaluate_grounding(doc_score, self.config)

        return VerifiedResponse(
            text=llm_output,
            verdict=verdict,
            grounded_ratio=report["grounded_ratio"],
            hallucination_ratio=report["hallucination_ratio"],
            flagged_sentences=report["flagged_sentences"],
            passed=(verdict != GroundingVerdict.FAIL),
        )

    def verify_or_reject(
        self,
        llm_output: str,
        source_passages: list[str],
        fallback: str = "I cannot verify this response against the available sources.",
    ) -> str:
        """Return the LLM output if it passes, or a fallback message if it fails."""
        result = self.verify(llm_output, source_passages)
        if result.passed:
            return result.text
        return fallback


# Usage
verifier = HallucinationVerifier(min_grounded=0.7, max_hallucinated=0.1)

sources = [
    "The Python GIL prevents true parallel execution of threads for CPU-bound tasks.",
    "Python 3.13 includes an experimental free-threaded build that removes the GIL.",
]

response = "Python's GIL prevents parallel thread execution. Python 3.13 offers an experimental build without the GIL. This makes Python faster than C for CPU-bound work."

result = verifier.verify(response, sources)
print(f"Passed: {result.passed}")
print(f"Grounded: {result.grounded_ratio:.0%}")
for item in result.flagged_sentences:
    print(f"  FLAGGED: {item['sentence']}")
    print(f"    contradiction={item['contradiction']}")

# Or use the reject-on-fail version
safe_output = verifier.verify_or_reject(response, sources)
print(f"\nFinal output: {safe_output}")

The verify_or_reject method is what you’d use in production. If the grounding check fails, return a safe fallback instead of the hallucinated text. You can also use verify directly and present flagged sentences to the user with warnings.

Plugging Into a RAG Pipeline

If you’re using a retrieval-augmented generation setup, slot the verifier in right after the LLM call:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
def rag_query(question: str, retriever, llm_client, verifier: HallucinationVerifier) -> str:
    """RAG pipeline with hallucination verification."""
    # Step 1: Retrieve source documents
    source_docs = retriever.search(question, top_k=5)
    source_texts = [doc.text for doc in source_docs]

    # Step 2: Generate response
    context = "\n\n".join(source_texts)
    llm_output = llm_client.generate(
        prompt=f"Answer based on the context:\n{context}\n\nQuestion: {question}",
    )

    # Step 3: Verify grounding
    result = verifier.verify(llm_output, source_texts)

    if not result.passed:
        return (
            f"I found relevant documents but couldn't generate a fully verified answer. "
            f"Grounding confidence: {result.grounded_ratio:.0%}. "
            f"Please review the source documents directly."
        )
    return result.text

Common Errors and Fixes

Tokenizer truncation warnings

1
2
3
Token indices sequence length is longer than the specified maximum sequence length
for this model (1024 > 512). Running this sequence through the model will result
in indexing errors.

This happens when your source passages are too long. The DeBERTa model has a 512-token limit. Fix it by chunking your source documents before scoring:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
def chunk_text(text: str, max_tokens: int = 400) -> list[str]:
    """Split text into chunks that fit within the model's token limit."""
    words = text.split()
    chunks = []
    current_chunk = []
    current_length = 0

    for word in words:
        # Rough approximation: 1 word ~ 1.3 tokens
        estimated_tokens = len(word) / 4 + 1
        if current_length + estimated_tokens > max_tokens and current_chunk:
            chunks.append(" ".join(current_chunk))
            current_chunk = []
            current_length = 0
        current_chunk.append(word)
        current_length += estimated_tokens

    if current_chunk:
        chunks.append(" ".join(current_chunk))
    return chunks

CUDA out of memory

1
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB

If you’re scoring many sentences, batches can eat up GPU memory fast. Process sentences one at a time or use smaller batch sizes. Also make sure you’re using torch.no_grad() – without it, PyTorch stores intermediate activations for backpropagation you don’t need:

1
2
3
4
5
6
# Always wrap inference in no_grad
with torch.no_grad():
    logits = model(**inputs).logits

# If still hitting OOM, move to CPU for small batches
model = model.to("cpu")

NLTK punkt_tab data missing

1
LookupError: Resource punkt_tab not found.

The sent_tokenize function requires the punkt_tab tokenizer data. Download it before first use:

1
2
import nltk
nltk.download("punkt_tab", quiet=True)

If you’re in a Docker container or CI environment where downloads are blocked, pre-download the data and mount it:

1
2
python -c "import nltk; nltk.download('punkt_tab', download_dir='/app/nltk_data')"
export NLTK_DATA=/app/nltk_data

NLI-Based Grounding Scores#

Sentence-Level Hallucination Scoring#

Aggregate Scoring and Thresholds#

Building a Verification Middleware#

Plugging Into a RAG Pipeline#

Common Errors and Fixes#

Tokenizer truncation warnings#

CUDA out of memory#

NLTK punkt_tab data missing#

Related Guides#

About the Author