An LLM that can’t show its work is an LLM you can’t trust in production. When your RAG app says “the API rate limit is 500 requests per minute,” users need to know where that number came from. Did the model pull it from your docs, or did it hallucinate?
Here’s a working citation pipeline that takes an LLM response and maps each claim back to source chunks using embedding similarity:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
| import numpy as np
from openai import OpenAI
from sklearn.metrics.pairwise import cosine_similarity
client = OpenAI()
def get_embeddings(texts: list[str]) -> np.ndarray:
"""Get embeddings for a list of texts using OpenAI."""
response = client.embeddings.create(
model="text-embedding-3-small",
input=texts,
)
return np.array([item.embedding for item in response.data])
def attribute_claims(claims: list[str], sources: list[str], threshold: float = 0.45) -> list[dict]:
"""Map each claim to its most relevant source chunk."""
claim_embeddings = get_embeddings(claims)
source_embeddings = get_embeddings(sources)
similarities = cosine_similarity(claim_embeddings, source_embeddings)
results = []
for i, claim in enumerate(claims):
best_idx = int(np.argmax(similarities[i]))
best_score = float(similarities[i][best_idx])
results.append({
"claim": claim,
"source": sources[best_idx],
"source_index": best_idx,
"similarity": round(best_score, 4),
"attributed": best_score >= threshold,
})
return results
# Example
sources = [
"The free tier allows 100 requests per minute. Paid plans support up to 1000 rpm.",
"Authentication requires a Bearer token in the Authorization header.",
"All responses are returned in JSON format with a 'data' field.",
]
claims = [
"Free tier users can make 100 requests per minute.",
"You need a Bearer token to authenticate.",
"The premium plan costs $49 per month.", # Not in sources
]
for result in attribute_claims(claims, sources):
status = "ATTRIBUTED" if result["attributed"] else "UNATTRIBUTED"
print(f"[{status}] {result['claim']}")
print(f" -> Source: {result['source'][:80]}... (similarity: {result['similarity']})")
|
That third claim about pricing? It scores low on similarity because no source mentions pricing. That’s the signal you need to flag or remove unsupported claims.
Why Citations Matter#
Without citations, every LLM output is an assertion you’re asking users to take on faith. That’s a problem for three reasons.
Hallucination detection. If the model claims something and you can’t trace it back to a source, that claim is suspect. Citation gaps are a direct proxy for hallucination risk. The attribution pipeline above catches exactly this – when a claim scores below your similarity threshold, it’s either hallucinated or poorly grounded.
User trust. People trust answers they can verify. A response that says “According to the API documentation [2], the rate limit is 100 rpm” is fundamentally more credible than one that just states the limit. This isn’t about aesthetics – it changes whether users act on your app’s outputs.
Legal and compliance. If your app operates in regulated domains (healthcare, finance, legal), you may be required to show provenance for every recommendation. An audit trail of source-to-claim mappings is your defense.
Inline Citation with RAG#
The most user-friendly approach is getting the LLM to cite sources inline using [1], [2] markers as it generates. This requires careful prompting and post-generation validation.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
| from openai import OpenAI
client = OpenAI()
def generate_with_citations(query: str, sources: list[str]) -> dict:
"""Generate a response with inline citations and validate them."""
# Build numbered source context
source_text = ""
for i, source in enumerate(sources, 1):
source_text += f"[{i}] {source}\n\n"
response = client.chat.completions.create(
model="gpt-4o",
temperature=0.1,
messages=[
{
"role": "system",
"content": (
"Answer the question using ONLY the provided sources. "
"Cite every factual claim with the source number in brackets, e.g. [1]. "
"A single sentence can have multiple citations. "
"If no source supports a claim, do not make that claim. "
"If you cannot answer from the sources, say so."
),
},
{
"role": "user",
"content": f"Sources:\n{source_text}\nQuestion: {query}",
},
],
)
answer = response.choices[0].message.content
# Validate: extract cited numbers and check they exist
import re
cited_numbers = set(int(n) for n in re.findall(r'\[(\d+)\]', answer))
valid_range = set(range(1, len(sources) + 1))
invalid_citations = cited_numbers - valid_range
return {
"answer": answer,
"cited_sources": cited_numbers,
"invalid_citations": invalid_citations,
"all_citations_valid": len(invalid_citations) == 0,
}
sources = [
"Python 3.12 introduced per-interpreter GIL as an experimental feature.",
"The match statement was added in Python 3.10 for structural pattern matching.",
"Python 3.13 ships with an experimental JIT compiler.",
]
result = generate_with_citations("What are the recent Python features?", sources)
print(result["answer"])
print(f"Valid citations: {result['all_citations_valid']}")
print(f"Sources used: {result['cited_sources']}")
|
The validation step is critical. Models sometimes cite [7] when you only have 3 sources, or cite [0] which doesn’t exist. Always check that every cited number maps to a real source.
Structured Citation Outputs#
Inline [1] markers are fine for display, but if you need to programmatically process citations – build a citation graph, score attribution quality, or feed results into another system – structured outputs are the way to go.
Use OpenAI’s tools parameter to force the model into returning citations as structured JSON:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
| import json
from openai import OpenAI
client = OpenAI()
citation_tool = {
"type": "function",
"function": {
"name": "provide_cited_answer",
"description": "Provide an answer with citations mapped to source documents.",
"parameters": {
"type": "object",
"properties": {
"answer": {
"type": "string",
"description": "The complete answer to the user's question.",
},
"citations": {
"type": "array",
"items": {
"type": "object",
"properties": {
"claim": {
"type": "string",
"description": "A specific factual claim from the answer.",
},
"source_index": {
"type": "integer",
"description": "The 1-based index of the source that supports this claim.",
},
"quote": {
"type": "string",
"description": "The exact quote from the source supporting the claim.",
},
},
"required": ["claim", "source_index", "quote"],
},
"description": "List of citations mapping claims to sources.",
},
},
"required": ["answer", "citations"],
},
},
}
def generate_structured_citations(query: str, sources: list[str]) -> dict:
"""Generate an answer with structured, machine-readable citations."""
source_text = "\n\n".join(f"[{i}] {s}" for i, s in enumerate(sources, 1))
response = client.chat.completions.create(
model="gpt-4o",
temperature=0.1,
messages=[
{
"role": "system",
"content": (
"Answer the question using ONLY the provided sources. "
"Use the provide_cited_answer function to return your response "
"with exact quotes from each source you reference."
),
},
{
"role": "user",
"content": f"Sources:\n{source_text}\n\nQuestion: {query}",
},
],
tools=[citation_tool],
tool_choice={"type": "function", "function": {"name": "provide_cited_answer"}},
)
tool_call = response.choices[0].message.tool_calls[0]
result = json.loads(tool_call.function.arguments)
return result
sources = [
"GPT-4o processes text at 128K context window and images up to 2048x2048.",
"Pricing for GPT-4o is $2.50 per million input tokens and $10 per million output tokens.",
]
result = generate_structured_citations("What are GPT-4o's specs and pricing?", sources)
print(f"Answer: {result['answer']}\n")
for cite in result["citations"]:
print(f"Claim: {cite['claim']}")
print(f" Source [{cite['source_index']}]: \"{cite['quote']}\"")
|
The quote field is the secret weapon here. By asking the model to extract the exact supporting text, you get something you can verify programmatically – check whether that quote actually exists in the source document.
Verification Pipeline#
Citations are only useful if they’re accurate. Here’s a verification pipeline that checks whether cited sources actually support the claims made:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
| import numpy as np
from openai import OpenAI
from sklearn.metrics.pairwise import cosine_similarity
client = OpenAI()
def verify_citations(citations: list[dict], sources: list[str]) -> list[dict]:
"""Verify that each citation's claimed source actually supports the claim.
Each citation dict must have: claim, source_index, quote
Sources is the original 0-indexed list of source texts.
"""
verified = []
for cite in citations:
source_idx = cite["source_index"] - 1 # Convert to 0-indexed
checks = {"claim": cite["claim"], "source_index": cite["source_index"]}
# Check 1: Does the source index exist?
if source_idx < 0 or source_idx >= len(sources):
checks["index_valid"] = False
checks["quote_found"] = False
checks["semantic_match"] = 0.0
checks["verdict"] = "INVALID_SOURCE"
verified.append(checks)
continue
checks["index_valid"] = True
source_text = sources[source_idx]
# Check 2: Does the quoted text appear in the source?
quote = cite.get("quote", "")
checks["quote_found"] = quote.lower() in source_text.lower()
# Check 3: Semantic similarity between claim and source
response = client.embeddings.create(
model="text-embedding-3-small",
input=[cite["claim"], source_text],
)
claim_emb = np.array(response.data[0].embedding).reshape(1, -1)
source_emb = np.array(response.data[1].embedding).reshape(1, -1)
similarity = float(cosine_similarity(claim_emb, source_emb)[0][0])
checks["semantic_match"] = round(similarity, 4)
# Verdict
if not checks["quote_found"] and checks["semantic_match"] < 0.4:
checks["verdict"] = "UNSUPPORTED"
elif not checks["quote_found"] and checks["semantic_match"] >= 0.4:
checks["verdict"] = "PARAPHRASED"
else:
checks["verdict"] = "VERIFIED"
verified.append(checks)
return verified
# Example: verify citations from the structured output
citations = [
{
"claim": "GPT-4o has a 128K context window",
"source_index": 1,
"quote": "128K context window",
},
{
"claim": "GPT-4o costs $5 per million input tokens",
"source_index": 2,
"quote": "$5 per million input tokens",
},
]
sources = [
"GPT-4o processes text at 128K context window and images up to 2048x2048.",
"Pricing for GPT-4o is $2.50 per million input tokens and $10 per million output tokens.",
]
for v in verify_citations(citations, sources):
print(f"[{v['verdict']}] {v['claim']}")
print(f" Quote found: {v['quote_found']}, Semantic match: {v['semantic_match']}")
|
The second citation is interesting – both the claim and quote say “$5” but the actual source says “$2.50.” The quote_found check will fail because “$5 per million” doesn’t appear in the source text, flagging this as unsupported despite high semantic similarity. For numbers and dates, add an explicit extraction-and-comparison step on top of embedding similarity.
Building a Citation Quality Scorer#
Combine everything into a single score that tells you how well-cited a response is:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
| def score_citation_quality(
answer: str,
citations: list[dict],
sources: list[str],
verification_results: list[dict],
) -> dict:
"""Score the overall citation quality of a response.
Returns a score from 0.0 to 1.0 with a breakdown.
"""
import re
# Extract sentences from the answer (rough claim count)
sentences = [s.strip() for s in re.split(r'[.!?]+', answer) if len(s.strip()) > 10]
total_claims = max(len(sentences), 1)
# Coverage: what fraction of claims have any citation?
cited_claims = len(citations)
coverage = min(cited_claims / total_claims, 1.0)
# Accuracy: what fraction of citations are verified?
if not verification_results:
accuracy = 0.0
else:
verified_count = sum(
1 for v in verification_results if v["verdict"] in ("VERIFIED", "PARAPHRASED")
)
accuracy = verified_count / len(verification_results)
# Source diversity: are we citing multiple sources or just one?
unique_sources = len(set(c["source_index"] for c in citations)) if citations else 0
available_sources = len(sources)
diversity = unique_sources / max(available_sources, 1)
# Weighted final score
score = (0.4 * coverage) + (0.4 * accuracy) + (0.2 * diversity)
return {
"overall_score": round(score, 4),
"coverage": round(coverage, 4),
"accuracy": round(accuracy, 4),
"diversity": round(diversity, 4),
"total_claims": total_claims,
"cited_claims": cited_claims,
"verified_citations": sum(
1 for v in verification_results if v["verdict"] == "VERIFIED"
),
"paraphrased_citations": sum(
1 for v in verification_results if v["verdict"] == "PARAPHRASED"
),
"unsupported_citations": sum(
1 for v in verification_results if v["verdict"] == "UNSUPPORTED"
),
}
|
A score above 0.7 generally means the response is well-cited. Below 0.5 means either the model isn’t citing enough, or the citations it provides don’t check out. Track this metric across your test suite and set alerts when it drops.
The three sub-scores tell you different things. Low coverage means the model is making uncited claims – tighten your system prompt to require citations on every factual statement. Low accuracy means the model is citing sources that don’t actually support its claims – this is a hallucination signal. Low diversity means the model is over-relying on one source and ignoring others – adjust your retrieval to surface more diverse chunks.
Common Errors and Fixes#
cosine_similarity returns all zeros or NaN. Your embeddings are probably empty or malformed. Check that the OpenAI API actually returned embedding vectors by printing len(response.data[0].embedding) – it should be 1536 for text-embedding-3-small. If the input text is empty or whitespace-only, the API returns a zero vector.
Model cites [0] or numbers outside the source range. This happens constantly. Always validate citation indices against your actual source list before rendering them to users. Strip or flag any citation where source_index < 1 or source_index > len(sources).
Structured output returns malformed JSON. Even with tool_choice set to force a specific function, the model occasionally produces JSON that doesn’t parse. Wrap json.loads() in a try/except and retry once with temperature=0. If it fails twice, fall back to the inline citation approach.
Quote verification has too many false negatives. Exact string matching is brittle – the model might slightly rephrase the quote or change capitalization. Use case-insensitive comparison (as shown in the code above) and consider fuzzy matching with difflib.SequenceMatcher for a similarity ratio above 0.8 instead of exact containment.
Embedding similarity scores are all clustered around 0.7-0.8. This is normal for text-embedding-3-small when comparing related texts. Your threshold needs tuning on your specific domain. Build a small labeled dataset of 20-30 claim-source pairs with known “supported” and “unsupported” labels, compute similarities, and find the threshold that best separates the two groups. It’s usually between 0.4 and 0.55.
Citation pipeline is too slow for real-time use. Each verification call to the embedding API adds latency. Batch your embedding requests – send all claims and sources in a single client.embeddings.create() call instead of one per citation. The code in the first section already does this. For the verification pipeline, refactor to batch all claims and sources together before computing similarities.