LLMs hallucinate. That’s not a bug you can patch – it’s baked into how autoregressive generation works. If you’re building anything that touches real users, you need a scoring system that tells you how much of a generated response is actually grounded in your source documents.
NLI (Natural Language Inference) models solve this. They take a premise-hypothesis pair and classify it as entailment, contradiction, or neutral. Feed your source document as the premise and each sentence from the LLM output as the hypothesis, and you get a per-sentence grounding score.
Here’s the simplest version – score a single claim against a source passage:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
| from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "cross-encoder/nli-deberta-v3-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
source = "The Apollo 11 mission landed on the Moon on July 20, 1969."
claim = "Apollo 11 reached the Moon in July 1969."
inputs = tokenizer(source, claim, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
logits = model(**inputs).logits
# Labels: 0 = contradiction, 1 = entailment, 2 = neutral
probs = torch.softmax(logits, dim=-1).squeeze()
labels = ["contradiction", "entailment", "neutral"]
for label, prob in zip(labels, probs):
print(f"{label}: {prob:.4f}")
# entailment: 0.9712
# contradiction: 0.0031
# neutral: 0.0257
|
That gives you a probability distribution across all three classes. The entailment score is your grounding confidence.
NLI-Based Grounding Scores#
The cross-encoder/nli-deberta-v3-base model works as a cross-encoder, meaning it processes both texts together through the full transformer stack. This gives you much better accuracy than bi-encoder approaches for this task.
Wrap the scoring logic into a reusable function:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
| from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from dataclasses import dataclass
model_name = "cross-encoder/nli-deberta-v3-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.eval()
@dataclass
class NLIResult:
contradiction: float
entailment: float
neutral: float
@property
def is_grounded(self) -> bool:
return self.entailment > self.contradiction and self.entailment > 0.5
@property
def is_hallucinated(self) -> bool:
return self.contradiction > self.entailment and self.contradiction > 0.5
def score_claim(premise: str, hypothesis: str) -> NLIResult:
"""Score a single claim against a source passage using NLI."""
inputs = tokenizer(
premise, hypothesis,
return_tensors="pt",
truncation=True,
max_length=512,
)
with torch.no_grad():
logits = model(**inputs).logits
probs = torch.softmax(logits, dim=-1).squeeze().tolist()
# Model label order: contradiction=0, entailment=1, neutral=2
return NLIResult(
contradiction=probs[0],
entailment=probs[1],
neutral=probs[2],
)
# Test it
source = "Python 3.12 was released in October 2023 with a new type statement."
claims = [
"Python 3.12 came out in October 2023.", # grounded
"Python 3.12 was released in March 2024.", # hallucinated
"Python supports dynamic typing.", # neutral (not in source)
]
for claim in claims:
result = score_claim(source, claim)
status = "GROUNDED" if result.is_grounded else (
"HALLUCINATED" if result.is_hallucinated else "UNSUPPORTED"
)
print(f"[{status}] {claim}")
print(f" entailment={result.entailment:.3f}, "
f"contradiction={result.contradiction:.3f}, "
f"neutral={result.neutral:.3f}")
|
The key insight: neutral doesn’t mean “fine.” A neutral score means the source material neither supports nor contradicts the claim. For grounding verification, neutral claims are unsupported – the LLM added information that isn’t backed by your documents.
Sentence-Level Hallucination Scoring#
Real LLM outputs are paragraphs, not single sentences. You need to break them apart and score each sentence independently. Some sentences will be grounded, others won’t – and you want to know exactly which ones failed.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
| import nltk
nltk.download("punkt_tab", quiet=True)
from nltk.tokenize import sent_tokenize
from dataclasses import dataclass, field
@dataclass
class SentenceScore:
sentence: str
result: NLIResult
best_source: str
@dataclass
class DocumentScore:
sentence_scores: list[SentenceScore] = field(default_factory=list)
@property
def grounded_ratio(self) -> float:
if not self.sentence_scores:
return 0.0
grounded = sum(1 for s in self.sentence_scores if s.result.is_grounded)
return grounded / len(self.sentence_scores)
@property
def hallucination_ratio(self) -> float:
if not self.sentence_scores:
return 0.0
hallucinated = sum(1 for s in self.sentence_scores if s.result.is_hallucinated)
return hallucinated / len(self.sentence_scores)
def score_response(
llm_output: str,
source_passages: list[str],
) -> DocumentScore:
"""Score each sentence of an LLM response against source passages."""
sentences = sent_tokenize(llm_output)
doc_score = DocumentScore()
for sentence in sentences:
# Skip very short fragments
if len(sentence.split()) < 3:
continue
best_entailment = -1.0
best_result = None
best_source = ""
# Score against each source passage, keep the best match
for passage in source_passages:
result = score_claim(passage, sentence)
if result.entailment > best_entailment:
best_entailment = result.entailment
best_result = result
best_source = passage
doc_score.sentence_scores.append(
SentenceScore(
sentence=sentence,
result=best_result,
best_source=best_source,
)
)
return doc_score
# Example usage
sources = [
"Tesla was founded in 2003 by Martin Eberhard and Marc Tarpenning.",
"Elon Musk joined Tesla in 2004 as chairman of the board after leading the Series A funding round.",
]
llm_output = (
"Tesla was founded in 2003 by Martin Eberhard and Marc Tarpenning. "
"Elon Musk co-founded Tesla alongside them in 2003. "
"The company went public in 2010 with a successful IPO."
)
doc = score_response(llm_output, sources)
for ss in doc.sentence_scores:
tag = "GROUNDED" if ss.result.is_grounded else (
"HALLUCINATED" if ss.result.is_hallucinated else "UNSUPPORTED"
)
print(f"[{tag}] {ss.sentence}")
print(f" entailment={ss.result.entailment:.3f}")
print(f"\nGrounded: {doc.grounded_ratio:.0%}")
print(f"Hallucinated: {doc.hallucination_ratio:.0%}")
|
This catches the classic “Elon Musk co-founded Tesla” hallucination. The source says he joined in 2004, not that he co-founded it. The NLI model flags the contradiction. The IPO sentence gets marked as unsupported – it may be true, but it’s not backed by the provided sources.
Aggregate Scoring and Thresholds#
Per-sentence scores are useful for debugging, but you need aggregate numbers to make automated decisions. Here’s how to compute document-level hallucination scores and set actionable thresholds.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
| from enum import Enum
class GroundingVerdict(Enum):
PASS = "pass"
WARN = "warn"
FAIL = "fail"
@dataclass
class GroundingConfig:
"""Thresholds for grounding decisions."""
min_grounded_ratio: float = 0.7 # At least 70% of sentences must be grounded
max_hallucination_ratio: float = 0.1 # No more than 10% contradictions allowed
warn_grounded_ratio: float = 0.85 # Warn if below 85% grounded
def evaluate_grounding(
doc_score: DocumentScore,
config: GroundingConfig = GroundingConfig(),
) -> tuple[GroundingVerdict, dict]:
"""Evaluate a document score against thresholds."""
grounded = doc_score.grounded_ratio
hallucinated = doc_score.hallucination_ratio
total_sentences = len(doc_score.sentence_scores)
# Collect flagged sentences for the report
flagged = [
{
"sentence": ss.sentence,
"entailment": round(ss.result.entailment, 3),
"contradiction": round(ss.result.contradiction, 3),
}
for ss in doc_score.sentence_scores
if ss.result.is_hallucinated or not ss.result.is_grounded
]
report = {
"total_sentences": total_sentences,
"grounded_ratio": round(grounded, 3),
"hallucination_ratio": round(hallucinated, 3),
"flagged_sentences": flagged,
}
if hallucinated > config.max_hallucination_ratio:
return GroundingVerdict.FAIL, report
if grounded < config.min_grounded_ratio:
return GroundingVerdict.FAIL, report
if grounded < config.warn_grounded_ratio:
return GroundingVerdict.WARN, report
return GroundingVerdict.PASS, report
# Using the earlier doc_score
config = GroundingConfig(
min_grounded_ratio=0.7,
max_hallucination_ratio=0.1,
warn_grounded_ratio=0.85,
)
verdict, report = evaluate_grounding(doc_score, config)
print(f"Verdict: {verdict.value}")
print(f"Grounded: {report['grounded_ratio']:.0%}")
print(f"Hallucinated: {report['hallucination_ratio']:.0%}")
print(f"Flagged sentences: {len(report['flagged_sentences'])}")
|
Pick your thresholds based on your risk tolerance. Medical and legal apps should set max_hallucination_ratio close to 0. A casual chatbot can tolerate more. Start strict and loosen up once you see how your model performs on real traffic.
Building a Verification Middleware#
You probably don’t want to manually call scoring functions every time. Wrap everything into a middleware class that sits between your LLM call and the user.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
| from dataclasses import dataclass, field
from typing import Optional
import json
@dataclass
class VerifiedResponse:
text: str
verdict: GroundingVerdict
grounded_ratio: float
hallucination_ratio: float
flagged_sentences: list[dict]
passed: bool
class HallucinationVerifier:
"""Middleware that scores LLM outputs against source documents."""
def __init__(
self,
min_grounded: float = 0.7,
max_hallucinated: float = 0.1,
):
self.config = GroundingConfig(
min_grounded_ratio=min_grounded,
max_hallucination_ratio=max_hallucinated,
)
def verify(
self,
llm_output: str,
source_passages: list[str],
) -> VerifiedResponse:
"""Score an LLM response and return a verified result."""
doc_score = score_response(llm_output, source_passages)
verdict, report = evaluate_grounding(doc_score, self.config)
return VerifiedResponse(
text=llm_output,
verdict=verdict,
grounded_ratio=report["grounded_ratio"],
hallucination_ratio=report["hallucination_ratio"],
flagged_sentences=report["flagged_sentences"],
passed=(verdict != GroundingVerdict.FAIL),
)
def verify_or_reject(
self,
llm_output: str,
source_passages: list[str],
fallback: str = "I cannot verify this response against the available sources.",
) -> str:
"""Return the LLM output if it passes, or a fallback message if it fails."""
result = self.verify(llm_output, source_passages)
if result.passed:
return result.text
return fallback
# Usage
verifier = HallucinationVerifier(min_grounded=0.7, max_hallucinated=0.1)
sources = [
"The Python GIL prevents true parallel execution of threads for CPU-bound tasks.",
"Python 3.13 includes an experimental free-threaded build that removes the GIL.",
]
response = "Python's GIL prevents parallel thread execution. Python 3.13 offers an experimental build without the GIL. This makes Python faster than C for CPU-bound work."
result = verifier.verify(response, sources)
print(f"Passed: {result.passed}")
print(f"Grounded: {result.grounded_ratio:.0%}")
for item in result.flagged_sentences:
print(f" FLAGGED: {item['sentence']}")
print(f" contradiction={item['contradiction']}")
# Or use the reject-on-fail version
safe_output = verifier.verify_or_reject(response, sources)
print(f"\nFinal output: {safe_output}")
|
The verify_or_reject method is what you’d use in production. If the grounding check fails, return a safe fallback instead of the hallucinated text. You can also use verify directly and present flagged sentences to the user with warnings.
Plugging Into a RAG Pipeline#
If you’re using a retrieval-augmented generation setup, slot the verifier in right after the LLM call:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
| def rag_query(question: str, retriever, llm_client, verifier: HallucinationVerifier) -> str:
"""RAG pipeline with hallucination verification."""
# Step 1: Retrieve source documents
source_docs = retriever.search(question, top_k=5)
source_texts = [doc.text for doc in source_docs]
# Step 2: Generate response
context = "\n\n".join(source_texts)
llm_output = llm_client.generate(
prompt=f"Answer based on the context:\n{context}\n\nQuestion: {question}",
)
# Step 3: Verify grounding
result = verifier.verify(llm_output, source_texts)
if not result.passed:
return (
f"I found relevant documents but couldn't generate a fully verified answer. "
f"Grounding confidence: {result.grounded_ratio:.0%}. "
f"Please review the source documents directly."
)
return result.text
|
Common Errors and Fixes#
Tokenizer truncation warnings#
1
2
3
| Token indices sequence length is longer than the specified maximum sequence length
for this model (1024 > 512). Running this sequence through the model will result
in indexing errors.
|
This happens when your source passages are too long. The DeBERTa model has a 512-token limit. Fix it by chunking your source documents before scoring:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
| def chunk_text(text: str, max_tokens: int = 400) -> list[str]:
"""Split text into chunks that fit within the model's token limit."""
words = text.split()
chunks = []
current_chunk = []
current_length = 0
for word in words:
# Rough approximation: 1 word ~ 1.3 tokens
estimated_tokens = len(word) / 4 + 1
if current_length + estimated_tokens > max_tokens and current_chunk:
chunks.append(" ".join(current_chunk))
current_chunk = []
current_length = 0
current_chunk.append(word)
current_length += estimated_tokens
if current_chunk:
chunks.append(" ".join(current_chunk))
return chunks
|
CUDA out of memory#
1
| torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 256.00 MiB
|
If you’re scoring many sentences, batches can eat up GPU memory fast. Process sentences one at a time or use smaller batch sizes. Also make sure you’re using torch.no_grad() – without it, PyTorch stores intermediate activations for backpropagation you don’t need:
1
2
3
4
5
6
| # Always wrap inference in no_grad
with torch.no_grad():
logits = model(**inputs).logits
# If still hitting OOM, move to CPU for small batches
model = model.to("cpu")
|
NLTK punkt_tab data missing#
1
| LookupError: Resource punkt_tab not found.
|
The sent_tokenize function requires the punkt_tab tokenizer data. Download it before first use:
1
2
| import nltk
nltk.download("punkt_tab", quiet=True)
|
If you’re in a Docker container or CI environment where downloads are blocked, pre-download the data and mount it:
1
2
| python -c "import nltk; nltk.download('punkt_tab', download_dir='/app/nltk_data')"
export NLTK_DATA=/app/nltk_data
|