The Core Idea: Verify Every Claim Against Sources#
LLMs generate fluent text that sounds authoritative even when it’s wrong. If you’re building a RAG app, you already have source documents. The missing piece is a verification layer that checks whether the LLM’s output is actually supported by those sources.
The approach breaks down into three steps:
- Extract claims from the LLM response
- Score each claim against source documents using an NLI model
- Filter or flag responses that contain unsupported claims
This gives you a grounding score per claim and per response, plus a clear audit trail of what’s supported and what isn’t.
Before you can verify anything, you need to break the LLM’s response into individual, verifiable claims. A single paragraph might contain five distinct factual assertions. You need each one isolated.
Use the LLM itself for claim extraction – it’s genuinely good at this task:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
| from openai import OpenAI
import json
client = OpenAI()
def extract_claims(text: str) -> list[str]:
response = client.chat.completions.create(
model="gpt-4o-mini",
temperature=0,
response_format={"type": "json_object"},
messages=[
{
"role": "system",
"content": (
"Extract every factual claim from the text. "
"Return JSON: {\"claims\": [\"claim1\", \"claim2\", ...]}. "
"Each claim should be a single, self-contained statement. "
"Skip opinions, hedged statements, and filler."
),
},
{"role": "user", "content": text},
],
)
result = json.loads(response.choices[0].message.content)
return result.get("claims", [])
# Example
llm_output = (
"Python was created by Guido van Rossum and first released in 1991. "
"It uses dynamic typing and garbage collection. "
"Python 3.12 introduced the new type statement for type aliases."
)
claims = extract_claims(llm_output)
for claim in claims:
print(f"- {claim}")
|
Output:
1
2
3
4
5
| - Python was created by Guido van Rossum
- Python was first released in 1991
- Python uses dynamic typing
- Python uses garbage collection
- Python 3.12 introduced the new type statement for type aliases
|
Each claim is now a standalone sentence you can verify independently.
Scoring Claims with NLI#
Natural Language Inference classifies the relationship between a premise (your source document) and a hypothesis (the extracted claim) into three labels: entailment (supported), contradiction (refuted), or neutral (neither supported nor refuted).
The cross-encoder/nli-deberta-v3-base model is the best open-source option for this. It’s fast, accurate, and runs on CPU for moderate workloads.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
| from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
MODEL_NAME = "cross-encoder/nli-deberta-v3-base"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME)
model.eval()
LABELS = ["contradiction", "entailment", "neutral"]
def score_claim(claim: str, source: str) -> dict:
"""Score a single claim against a source passage using NLI."""
inputs = tokenizer(
source, claim, return_tensors="pt", truncation=True, max_length=512
)
with torch.no_grad():
logits = model(**inputs).logits
probs = torch.softmax(logits, dim=-1).squeeze().tolist()
scores = {label: round(prob, 4) for label, prob in zip(LABELS, probs)}
scores["verdict"] = LABELS[probs.index(max(probs))]
return scores
# Verify a claim against a source document
source_doc = (
"Python is a high-level programming language created by Guido van Rossum. "
"It was first released in 1991. Python supports multiple programming paradigms "
"including procedural, object-oriented, and functional programming."
)
result = score_claim("Python was created by Guido van Rossum", source_doc)
print(result)
# {'contradiction': 0.0012, 'entailment': 0.9953, 'neutral': 0.0035, 'verdict': 'entailment'}
|
The model gives you probability scores across all three labels. A high entailment score means the source directly supports the claim. A high contradiction score means the source actively disputes it. Neutral means the source doesn’t address the claim at all – which is its own kind of red flag in a RAG context.
Building the Full Grounding Pipeline#
Now combine claim extraction and NLI scoring into a single pipeline that takes an LLM response plus its source documents and returns a grounding report:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
| from dataclasses import dataclass
@dataclass
class ClaimResult:
claim: str
best_source: str
entailment: float
contradiction: float
neutral: float
verdict: str
@dataclass
class GroundingReport:
claims: list[ClaimResult]
grounding_score: float
passed: bool
def check_grounding(
llm_response: str,
source_docs: list[str],
threshold: float = 0.7,
) -> GroundingReport:
"""Run full grounding check on an LLM response against source documents."""
claims = extract_claims(llm_response)
results = []
for claim in claims:
best_score = None
best_source = ""
for doc in source_docs:
scores = score_claim(claim, doc)
if best_score is None or scores["entailment"] > best_score["entailment"]:
best_score = scores
best_source = doc[:100] + "..."
results.append(
ClaimResult(
claim=claim,
best_source=best_source,
entailment=best_score["entailment"],
contradiction=best_score["contradiction"],
neutral=best_score["neutral"],
verdict=best_score["verdict"],
)
)
supported_count = sum(1 for r in results if r.verdict == "entailment")
grounding_score = supported_count / len(results) if results else 0.0
return GroundingReport(
claims=results,
grounding_score=round(grounding_score, 3),
passed=grounding_score >= threshold,
)
|
The threshold parameter controls how strict the check is. At 0.7, you require 70% of claims to be directly supported by source documents. For high-stakes applications (medical, legal, financial), push this to 0.9 or higher.
Integrating Grounding Checks into a RAG Pipeline#
Here’s where this becomes practical. In a RAG app, you already retrieve context before generating. Add the grounding check as a post-generation step:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
| def rag_with_grounding(query: str, retrieved_docs: list[str]) -> dict:
"""RAG pipeline with grounding verification."""
context = "\n\n---\n\n".join(retrieved_docs)
# Generate response
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": (
"Answer the question using only the provided context. "
"If the context doesn't contain enough information, say so."
),
},
{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {query}",
},
],
)
answer = response.choices[0].message.content
# Verify grounding
report = check_grounding(answer, retrieved_docs, threshold=0.7)
if not report.passed:
flagged_claims = [
r.claim for r in report.claims if r.verdict != "entailment"
]
return {
"answer": answer,
"grounded": False,
"grounding_score": report.grounding_score,
"flagged_claims": flagged_claims,
"action": "Response contains ungrounded claims. Review before showing to user.",
}
return {
"answer": answer,
"grounded": True,
"grounding_score": report.grounding_score,
}
# Usage
docs = [
"The Eiffel Tower is 330 meters tall and located in Paris, France. "
"It was completed in 1889 for the World's Fair.",
"The tower receives about 7 million visitors per year. "
"Gustave Eiffel's company designed and built the structure.",
]
result = rag_with_grounding("How tall is the Eiffel Tower?", docs)
print(f"Grounded: {result['grounded']}")
print(f"Score: {result['grounding_score']}")
|
When the grounding check fails, you have options: return a warning to the user, regenerate with stricter instructions, or fall back to a “I don’t have enough information” response. Pick the strategy that matches your risk tolerance.
Tuning Thresholds for Your Use Case#
Not every application needs the same grounding strictness. Here’s a practical breakdown:
| Use Case | Threshold | Rationale |
|---|
| Customer support bot | 0.6 | Some paraphrasing is fine |
| Internal knowledge base | 0.7 | Good balance of accuracy and coverage |
| Medical/legal Q&A | 0.9 | Nearly every claim must be sourced |
| Research assistant | 0.5 | Exploratory answers are acceptable |
You can also apply per-claim thresholds instead of aggregate scores. Flag any individual claim with entailment below 0.5, even if the overall score is high:
1
2
3
4
5
6
7
| def flag_weak_claims(report: GroundingReport, min_entailment: float = 0.5) -> list[str]:
"""Find individual claims with low entailment scores."""
return [
f"[{r.entailment:.2f}] {r.claim}"
for r in report.claims
if r.entailment < min_entailment
]
|
This catches the case where nine out of ten claims are grounded but one is completely fabricated – the overall score looks fine, but that single bad claim could cause real harm.
Common Errors and Fixes#
RuntimeError: CUDA out of memory when loading the NLI model
The DeBERTa model runs fine on CPU for low-to-medium throughput. Force CPU explicitly:
1
| model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME).to("cpu")
|
For high throughput, batch your inputs and use a GPU with at least 4GB VRAM.
NLI model gives “neutral” for everything
This usually means your source text is too long and gets truncated at 512 tokens. Split source documents into paragraph-level chunks before scoring:
1
2
3
4
5
6
| def chunk_text(text: str, max_words: int = 200) -> list[str]:
words = text.split()
return [
" ".join(words[i : i + max_words])
for i in range(0, len(words), max_words)
]
|
Score each chunk separately and take the highest entailment score across chunks.
Claim extraction returns vague or compound claims
Add explicit instructions to the extraction prompt: “Each claim must contain exactly one verifiable fact. Split compound sentences into separate claims.” You can also add few-shot examples to the system prompt showing the expected granularity.
High latency in production
The NLI model inference is the bottleneck. Two fixes: (1) use ONNX Runtime for 2-3x speedup on CPU, or (2) batch all claim-source pairs into a single forward pass instead of scoring them one at a time.
1
| pip install optimum[onnxruntime]
|
1
2
3
4
5
| from optimum.onnxruntime import ORTModelForSequenceClassification
model = ORTModelForSequenceClassification.from_pretrained(
MODEL_NAME, export=True
)
|
Grounding score is too strict for paraphrased answers
NLI models handle paraphrasing well, but extreme rewording can drop entailment scores. If the LLM summarizes rather than quotes, you might see entailment scores around 0.6-0.7 for claims that are technically correct. Lower your threshold or switch to a sentence similarity pre-filter before running NLI.