How to Build Output Grounding and Fact-Checking for LLM Apps

The Core Idea: Verify Every Claim Against Sources

LLMs generate fluent text that sounds authoritative even when it’s wrong. If you’re building a RAG app, you already have source documents. The missing piece is a verification layer that checks whether the LLM’s output is actually supported by those sources.

The approach breaks down into three steps:

Extract claims from the LLM response
Score each claim against source documents using an NLI model
Filter or flag responses that contain unsupported claims

This gives you a grounding score per claim and per response, plus a clear audit trail of what’s supported and what isn’t.

Extracting Claims from LLM Output

Before you can verify anything, you need to break the LLM’s response into individual, verifiable claims. A single paragraph might contain five distinct factual assertions. You need each one isolated.

Use the LLM itself for claim extraction – it’s genuinely good at this task:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
from openai import OpenAI
import json

client = OpenAI()

def extract_claims(text: str) -> list[str]:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        temperature=0,
        response_format={"type": "json_object"},
        messages=[
            {
                "role": "system",
                "content": (
                    "Extract every factual claim from the text. "
                    "Return JSON: {\"claims\": [\"claim1\", \"claim2\", ...]}. "
                    "Each claim should be a single, self-contained statement. "
                    "Skip opinions, hedged statements, and filler."
                ),
            },
            {"role": "user", "content": text},
        ],
    )
    result = json.loads(response.choices[0].message.content)
    return result.get("claims", [])


# Example
llm_output = (
    "Python was created by Guido van Rossum and first released in 1991. "
    "It uses dynamic typing and garbage collection. "
    "Python 3.12 introduced the new type statement for type aliases."
)

claims = extract_claims(llm_output)
for claim in claims:
    print(f"- {claim}")

Output:

1
2
3
4
5
- Python was created by Guido van Rossum
- Python was first released in 1991
- Python uses dynamic typing
- Python uses garbage collection
- Python 3.12 introduced the new type statement for type aliases

Each claim is now a standalone sentence you can verify independently.

Scoring Claims with NLI

Natural Language Inference classifies the relationship between a premise (your source document) and a hypothesis (the extracted claim) into three labels: entailment (supported), contradiction (refuted), or neutral (neither supported nor refuted).

The cross-encoder/nli-deberta-v3-base model is the best open-source option for this. It’s fast, accurate, and runs on CPU for moderate workloads.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

MODEL_NAME = "cross-encoder/nli-deberta-v3-base"
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME)
model.eval()

LABELS = ["contradiction", "entailment", "neutral"]


def score_claim(claim: str, source: str) -> dict:
    """Score a single claim against a source passage using NLI."""
    inputs = tokenizer(
        source, claim, return_tensors="pt", truncation=True, max_length=512
    )
    with torch.no_grad():
        logits = model(**inputs).logits
    probs = torch.softmax(logits, dim=-1).squeeze().tolist()
    scores = {label: round(prob, 4) for label, prob in zip(LABELS, probs)}
    scores["verdict"] = LABELS[probs.index(max(probs))]
    return scores


# Verify a claim against a source document
source_doc = (
    "Python is a high-level programming language created by Guido van Rossum. "
    "It was first released in 1991. Python supports multiple programming paradigms "
    "including procedural, object-oriented, and functional programming."
)

result = score_claim("Python was created by Guido van Rossum", source_doc)
print(result)
# {'contradiction': 0.0012, 'entailment': 0.9953, 'neutral': 0.0035, 'verdict': 'entailment'}

The model gives you probability scores across all three labels. A high entailment score means the source directly supports the claim. A high contradiction score means the source actively disputes it. Neutral means the source doesn’t address the claim at all – which is its own kind of red flag in a RAG context.

Building the Full Grounding Pipeline

Now combine claim extraction and NLI scoring into a single pipeline that takes an LLM response plus its source documents and returns a grounding report:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
from dataclasses import dataclass


@dataclass
class ClaimResult:
    claim: str
    best_source: str
    entailment: float
    contradiction: float
    neutral: float
    verdict: str


@dataclass
class GroundingReport:
    claims: list[ClaimResult]
    grounding_score: float
    passed: bool


def check_grounding(
    llm_response: str,
    source_docs: list[str],
    threshold: float = 0.7,
) -> GroundingReport:
    """Run full grounding check on an LLM response against source documents."""
    claims = extract_claims(llm_response)
    results = []

    for claim in claims:
        best_score = None
        best_source = ""

        for doc in source_docs:
            scores = score_claim(claim, doc)
            if best_score is None or scores["entailment"] > best_score["entailment"]:
                best_score = scores
                best_source = doc[:100] + "..."

        results.append(
            ClaimResult(
                claim=claim,
                best_source=best_source,
                entailment=best_score["entailment"],
                contradiction=best_score["contradiction"],
                neutral=best_score["neutral"],
                verdict=best_score["verdict"],
            )
        )

    supported_count = sum(1 for r in results if r.verdict == "entailment")
    grounding_score = supported_count / len(results) if results else 0.0

    return GroundingReport(
        claims=results,
        grounding_score=round(grounding_score, 3),
        passed=grounding_score >= threshold,
    )

The threshold parameter controls how strict the check is. At 0.7, you require 70% of claims to be directly supported by source documents. For high-stakes applications (medical, legal, financial), push this to 0.9 or higher.

Integrating Grounding Checks into a RAG Pipeline

Here’s where this becomes practical. In a RAG app, you already retrieve context before generating. Add the grounding check as a post-generation step:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
def rag_with_grounding(query: str, retrieved_docs: list[str]) -> dict:
    """RAG pipeline with grounding verification."""
    context = "\n\n---\n\n".join(retrieved_docs)

    # Generate response
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": (
                    "Answer the question using only the provided context. "
                    "If the context doesn't contain enough information, say so."
                ),
            },
            {
                "role": "user",
                "content": f"Context:\n{context}\n\nQuestion: {query}",
            },
        ],
    )
    answer = response.choices[0].message.content

    # Verify grounding
    report = check_grounding(answer, retrieved_docs, threshold=0.7)

    if not report.passed:
        flagged_claims = [
            r.claim for r in report.claims if r.verdict != "entailment"
        ]
        return {
            "answer": answer,
            "grounded": False,
            "grounding_score": report.grounding_score,
            "flagged_claims": flagged_claims,
            "action": "Response contains ungrounded claims. Review before showing to user.",
        }

    return {
        "answer": answer,
        "grounded": True,
        "grounding_score": report.grounding_score,
    }


# Usage
docs = [
    "The Eiffel Tower is 330 meters tall and located in Paris, France. "
    "It was completed in 1889 for the World's Fair.",
    "The tower receives about 7 million visitors per year. "
    "Gustave Eiffel's company designed and built the structure.",
]

result = rag_with_grounding("How tall is the Eiffel Tower?", docs)
print(f"Grounded: {result['grounded']}")
print(f"Score: {result['grounding_score']}")

When the grounding check fails, you have options: return a warning to the user, regenerate with stricter instructions, or fall back to a “I don’t have enough information” response. Pick the strategy that matches your risk tolerance.

Tuning Thresholds for Your Use Case

Not every application needs the same grounding strictness. Here’s a practical breakdown:

Use Case	Threshold	Rationale
Customer support bot	0.6	Some paraphrasing is fine
Internal knowledge base	0.7	Good balance of accuracy and coverage
Medical/legal Q&A	0.9	Nearly every claim must be sourced
Research assistant	0.5	Exploratory answers are acceptable

You can also apply per-claim thresholds instead of aggregate scores. Flag any individual claim with entailment below 0.5, even if the overall score is high:

1
2
3
4
5
6
7
def flag_weak_claims(report: GroundingReport, min_entailment: float = 0.5) -> list[str]:
    """Find individual claims with low entailment scores."""
    return [
        f"[{r.entailment:.2f}] {r.claim}"
        for r in report.claims
        if r.entailment < min_entailment
    ]

This catches the case where nine out of ten claims are grounded but one is completely fabricated – the overall score looks fine, but that single bad claim could cause real harm.

Common Errors and Fixes

RuntimeError: CUDA out of memory when loading the NLI model

The DeBERTa model runs fine on CPU for low-to-medium throughput. Force CPU explicitly:

1
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME).to("cpu")

For high throughput, batch your inputs and use a GPU with at least 4GB VRAM.

NLI model gives “neutral” for everything

This usually means your source text is too long and gets truncated at 512 tokens. Split source documents into paragraph-level chunks before scoring:

1
2
3
4
5
6
def chunk_text(text: str, max_words: int = 200) -> list[str]:
    words = text.split()
    return [
        " ".join(words[i : i + max_words])
        for i in range(0, len(words), max_words)
    ]

Score each chunk separately and take the highest entailment score across chunks.

Claim extraction returns vague or compound claims

Add explicit instructions to the extraction prompt: “Each claim must contain exactly one verifiable fact. Split compound sentences into separate claims.” You can also add few-shot examples to the system prompt showing the expected granularity.

High latency in production

The NLI model inference is the bottleneck. Two fixes: (1) use ONNX Runtime for 2-3x speedup on CPU, or (2) batch all claim-source pairs into a single forward pass instead of scoring them one at a time.

1
pip install optimum[onnxruntime]

1
2
3
4
5
from optimum.onnxruntime import ORTModelForSequenceClassification

model = ORTModelForSequenceClassification.from_pretrained(
    MODEL_NAME, export=True
)

Grounding score is too strict for paraphrased answers

NLI models handle paraphrasing well, but extreme rewording can drop entailment scores. If the LLM summarizes rather than quotes, you might see entailment scores around 0.6-0.7 for claims that are technically correct. Lower your threshold or switch to a sentence similarity pre-filter before running NLI.

The Core Idea: Verify Every Claim Against Sources#

Extracting Claims from LLM Output#

Scoring Claims with NLI#

Building the Full Grounding Pipeline#

Integrating Grounding Checks into a RAG Pipeline#

Tuning Thresholds for Your Use Case#

Common Errors and Fixes#

Related Guides#

About the Author