Most RAG tutorials stop at “stuff your documents into the prompt.” That gets you 60% of the way. The other 40% — source tracking, citation enforcement, and grounding validation — is what separates a demo from a production system. Here’s the pattern I use for every retrieval-augmented prompt I build.
The core idea: wrap each retrieved chunk in labeled delimiters, tell the model to cite sources by label, then verify the output actually references the provided context.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
| from openai import OpenAI
client = OpenAI()
# Simulated retrieved chunks with source metadata
chunks = [
{
"text": "Transformer models use self-attention mechanisms to weigh the importance of different tokens in a sequence. The attention score is computed as softmax(QK^T / sqrt(d_k))V.",
"source": "Vaswani et al., Attention Is All You Need, 2017",
"doc_id": "doc_001"
},
{
"text": "Retrieval-augmented generation combines a retriever component that fetches relevant documents with a generator that produces answers conditioned on those documents. This reduces hallucination rates by 30-50% compared to closed-book generation.",
"source": "Lewis et al., RAG: Retrieval-Augmented Generation, 2020",
"doc_id": "doc_002"
},
{
"text": "Prompt injection attacks can cause LLMs to ignore their system instructions. Grounding the model's output in retrieved context limits the attack surface by constraining what the model can reference.",
"source": "OWASP LLM Top 10, 2024",
"doc_id": "doc_003"
},
]
def build_grounded_prompt(query: str, chunks: list[dict]) -> list[dict]:
"""Build a prompt with labeled context chunks and citation instructions."""
context_block = ""
for i, chunk in enumerate(chunks):
label = f"[SOURCE_{i+1}]"
context_block += f"{label}\n"
context_block += f"Document: {chunk['source']}\n"
context_block += f"Content: {chunk['text']}\n"
context_block += f"[/SOURCE_{i+1}]\n\n"
system_message = """You are a research assistant. Answer the user's question using ONLY the provided context sources. Follow these rules strictly:
1. Only use information found in the SOURCE blocks below.
2. Cite every claim with the source label, e.g. [SOURCE_1].
3. If the context does not contain enough information to answer, say "I cannot answer this based on the provided sources."
4. Never fabricate information beyond what the sources state."""
user_message = f"""CONTEXT:
{context_block}
QUESTION: {query}
Answer with inline citations:"""
return [
{"role": "system", "content": system_message},
{"role": "user", "content": user_message},
]
messages = build_grounded_prompt(
"How does retrieval-augmented generation reduce hallucinations?",
chunks
)
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
temperature=0.1,
)
print(response.choices[0].message.content)
|
That prints something like: “Retrieval-augmented generation combines a retriever that fetches relevant documents with a generator conditioned on those documents, reducing hallucination rates by 30-50% compared to closed-book generation [SOURCE_2]. Additionally, grounding model output in retrieved context constrains what the model can reference [SOURCE_3].”
The key details: low temperature (0.1) keeps the model close to the source text, labeled delimiters make citations unambiguous, and the system prompt explicitly forbids fabrication.
Structuring Context Chunks with Delimiters#
The delimiter pattern matters more than you’d think. I’ve tested several formats, and labeled XML-style tags ([SOURCE_N]...[/SOURCE_N]) outperform plain numbered lists or markdown headers for citation accuracy. The model treats them as distinct blocks and references them consistently.
A few rules for chunk formatting:
- Keep chunks between 100-500 tokens. Shorter chunks give more precise citations. Longer chunks dilute the signal.
- Include metadata inside the delimiter. Document title, author, date — whatever helps the model disambiguate similar content.
- Order chunks by relevance. Put the most relevant chunk first. Models pay more attention to content that appears earlier in the context window.
- Cap at 5-10 chunks per prompt. Beyond that, citation accuracy drops. If you need more context, re-rank and trim.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| def format_chunks_with_metadata(chunks: list[dict], max_chunks: int = 8) -> str:
"""Format retrieved chunks with labeled delimiters and metadata."""
trimmed = chunks[:max_chunks]
blocks = []
for i, chunk in enumerate(trimmed):
label = f"SOURCE_{i+1}"
block = f"[{label}]\n"
block += f"Title: {chunk.get('title', 'Untitled')}\n"
block += f"Author: {chunk.get('author', 'Unknown')}\n"
block += f"Date: {chunk.get('date', 'N/A')}\n"
block += f"Content: {chunk['text']}\n"
block += f"[/{label}]"
blocks.append(block)
return "\n\n".join(blocks)
|
Enforcing Citation Tracking in the System Prompt#
The system prompt is where you set the grounding rules. Vague instructions like “cite your sources” produce inconsistent results. Be specific about the citation format, when to cite, and what to do when context is insufficient.
Here’s the system prompt template I recommend:
1
2
3
4
5
6
7
8
9
10
11
12
| GROUNDED_SYSTEM_PROMPT = """You are a technical assistant that answers questions using ONLY the provided source documents.
CITATION RULES:
- Every factual claim MUST include a citation in the format [SOURCE_N].
- A single sentence can have multiple citations if it draws from multiple sources.
- Direct quotes must be wrapped in quotation marks with the citation immediately after.
- If you cannot find relevant information in any source, respond with: "The provided sources do not contain information to answer this question."
PROHIBITED:
- Do not use prior knowledge. Only reference the provided SOURCE blocks.
- Do not paraphrase sources in a way that changes their meaning.
- Do not combine information from multiple sources to infer something none of them explicitly state."""
|
The “PROHIBITED” section is critical. Without it, the model will happily synthesize claims that no single source supports. That synthesis is often correct, but it’s not grounded — and in a production system, you need to know exactly where each claim came from.
Handling Multi-Source Answers#
When an answer draws from multiple chunks, you want the model to cite each one inline rather than dumping all citations at the end. The instruction “A single sentence can have multiple citations” handles this:
1
2
3
4
| # Example of what the model should produce:
# "Transformers use self-attention to weigh token importance [SOURCE_1],
# and when combined with retrieval, they reduce hallucination rates
# by 30-50% [SOURCE_2]."
|
Validating Grounding in the Response#
Getting citations in the output is step one. Step two is verifying those citations actually match the source content. Here’s a grounding validator that checks whether cited claims appear in the referenced chunks.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
| import re
def extract_citations(response_text: str) -> list[str]:
"""Pull all [SOURCE_N] citations from the model's response."""
return re.findall(r"\[SOURCE_(\d+)\]", response_text)
def validate_grounding(
response_text: str,
chunks: list[dict],
similarity_threshold: float = 0.3,
) -> dict:
"""Check if cited sources actually support the claims made.
Returns a grounding report with pass/fail per citation.
"""
cited_indices = extract_citations(response_text)
unique_indices = set(cited_indices)
report = {
"total_citations": len(cited_indices),
"unique_sources_cited": len(unique_indices),
"total_sources_available": len(chunks),
"uncited_sources": [],
"invalid_citations": [],
"grounding_checks": [],
}
# Check for citations referencing non-existent sources
for idx_str in unique_indices:
idx = int(idx_str) - 1 # Convert to 0-based
if idx < 0 or idx >= len(chunks):
report["invalid_citations"].append(f"SOURCE_{idx_str}")
# Check which sources were never cited
cited_set = {int(i) - 1 for i in unique_indices}
for i in range(len(chunks)):
if i not in cited_set:
report["uncited_sources"].append(f"SOURCE_{i+1}")
# Split response into sentences and check each cited sentence
sentences = re.split(r"(?<=[.!?])\s+", response_text)
for sentence in sentences:
sentence_citations = re.findall(r"\[SOURCE_(\d+)\]", sentence)
if not sentence_citations:
continue
clean_sentence = re.sub(r"\[SOURCE_\d+\]", "", sentence).strip().lower()
words_in_sentence = set(clean_sentence.split())
for idx_str in sentence_citations:
idx = int(idx_str) - 1
if idx < 0 or idx >= len(chunks):
continue
source_text = chunks[idx]["text"].lower()
source_words = set(source_text.split())
# Word overlap as a simple grounding signal
overlap = words_in_sentence & source_words
overlap_ratio = len(overlap) / max(len(words_in_sentence), 1)
report["grounding_checks"].append({
"sentence": sentence[:100],
"cited_source": f"SOURCE_{idx_str}",
"word_overlap_ratio": round(overlap_ratio, 2),
"grounded": overlap_ratio >= similarity_threshold,
})
return report
# Run the validator on our earlier response
sample_response = (
"Retrieval-augmented generation combines a retriever that fetches "
"relevant documents with a generator conditioned on those documents, "
"reducing hallucination rates by 30-50% compared to closed-book "
"generation [SOURCE_2]. Grounding model output in retrieved context "
"constrains what the model can reference, limiting prompt injection "
"attack surface [SOURCE_3]."
)
grounding_report = validate_grounding(sample_response, chunks)
print(f"Total citations: {grounding_report['total_citations']}")
print(f"Invalid citations: {grounding_report['invalid_citations']}")
for check in grounding_report["grounding_checks"]:
status = "PASS" if check["grounded"] else "FAIL"
print(f" [{status}] {check['cited_source']} — overlap: {check['word_overlap_ratio']}")
|
This word-overlap approach is fast and catches obvious fabrications. For production systems, swap in embedding-based similarity (cosine similarity between the sentence embedding and the source chunk embedding) for better accuracy. But word overlap gets you surprisingly far as a first pass.
Putting It All Together#
Here’s the full pipeline — retrieve, build the prompt, query the model, validate grounding:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
| from openai import OpenAI
client = OpenAI()
def retrieval_augmented_query(
query: str,
chunks: list[dict],
model: str = "gpt-4o",
max_chunks: int = 8,
) -> dict:
"""End-to-end grounded RAG query with citation validation."""
trimmed_chunks = chunks[:max_chunks]
messages = build_grounded_prompt(query, trimmed_chunks)
response = client.chat.completions.create(
model=model,
messages=messages,
temperature=0.1,
max_tokens=1024,
)
answer = response.choices[0].message.content
grounding = validate_grounding(answer, trimmed_chunks)
failed_checks = [
c for c in grounding["grounding_checks"] if not c["grounded"]
]
return {
"answer": answer,
"model": model,
"sources_used": grounding["unique_sources_cited"],
"total_citations": grounding["total_citations"],
"grounding_failures": len(failed_checks),
"all_grounded": len(failed_checks) == 0,
"report": grounding,
}
result = retrieval_augmented_query(
"What is the relationship between retrieval-augmented generation and hallucination reduction?",
chunks,
)
print(f"Answer: {result['answer']}\n")
print(f"Sources cited: {result['sources_used']}")
print(f"All claims grounded: {result['all_grounded']}")
if not result["all_grounded"]:
print(f"Grounding failures: {result['grounding_failures']}")
|
If all_grounded comes back False, you have a few options: retry with a stricter system prompt, increase the similarity threshold to catch borderline cases, or flag the response for human review.
Common Errors and Fixes#
openai.AuthenticationError: Incorrect API key provided
Your OPENAI_API_KEY environment variable is missing or wrong. Set it before running:
1
| export OPENAI_API_KEY="sk-..."
|
Model ignores citation instructions and responds without [SOURCE_N] tags
This happens most often with smaller models or high temperatures. Fix it by dropping temperature to 0.0-0.2 and adding a one-shot example in the prompt showing the expected citation format. If you’re using gpt-3.5-turbo, switch to gpt-4o-mini at minimum for reliable instruction following.
KeyError when accessing chunk metadata fields
Use .get() with defaults instead of direct key access. The format_chunks_with_metadata function above handles this with chunk.get('title', 'Untitled').
Grounding validator reports false positives (marks grounded claims as FAIL)
The word-overlap method breaks down when the model heavily paraphrases a source. Lower the similarity_threshold from 0.3 to 0.2, or switch to embedding-based similarity:
1
2
3
4
5
6
7
8
9
10
11
12
| def embedding_similarity(text_a: str, text_b: str) -> float:
"""Compute cosine similarity between two texts using OpenAI embeddings."""
response = client.embeddings.create(
model="text-embedding-3-small",
input=[text_a, text_b],
)
vec_a = response.data[0].embedding
vec_b = response.data[1].embedding
dot = sum(a * b for a, b in zip(vec_a, vec_b))
norm_a = sum(a * a for a in vec_a) ** 0.5
norm_b = sum(b * b for b in vec_b) ** 0.5
return dot / (norm_a * norm_b)
|
Response says “I cannot answer this” even though context contains the answer
Your chunks are probably too long and the relevant detail is buried. Split chunks to 100-300 tokens and re-rank by relevance before injection. Shorter, more targeted chunks give better results than long passages.
openai.RateLimitError during batch grounding validation
The embedding-based validator makes API calls per sentence-source pair. Add a simple delay or batch your embedding requests:
1
2
3
4
5
6
7
8
9
| import time
# Batch all texts into a single API call
all_texts = [sentence_a, sentence_b, source_text_a, source_text_b]
response = client.embeddings.create(
model="text-embedding-3-small",
input=all_texts,
)
# Then compute pairwise cosine similarities from the response
|
This cuts your API calls from O(n*m) to O(1) per validation pass.