How to Build Agentic RAG with Query Routing and Self-Reflection

Standard RAG has a blind spot. It always runs the same retrieval strategy no matter what the user asks. A factual lookup, a fuzzy semantic question, and a structured data query all get shoved through the same vector search. That works sometimes. It fails badly when the query needs keyword precision or a SQL lookup instead.

Agentic RAG fixes this by adding two capabilities: query routing (classify the query and pick the best retrieval path) and self-reflection (check if the retrieved context actually answers the question before generating a response). If the context is weak, the agent retries with a different strategy instead of hallucinating.

Here is the full setup.

1
pip install langchain langchain-openai langchain-community chromadb

1
2
import os
os.environ["OPENAI_API_KEY"] = "sk-your-key-here"

Classify Queries for Routing

The first step is a classifier that looks at the incoming query and decides which retrieval strategy to use. You do not need a fine-tuned model for this. A well-prompted LLM handles it reliably.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

classify_prompt = ChatPromptTemplate.from_messages([
    ("system", """You are a query classifier for a retrieval system. Classify the user query into exactly one category:

- "vector" — open-ended, conceptual, or semantic questions (e.g., "What are best practices for fine-tuning?")
- "keyword" — queries looking for specific terms, error messages, or exact matches (e.g., "CUDA out of memory error fix")
- "sql" — queries about structured data, metrics, counts, or comparisons (e.g., "How many users signed up last month?")

Respond with ONLY the category name: vector, keyword, or sql"""),
    ("human", "{query}")
])

classifier = classify_prompt | llm | StrOutputParser()

# Test it
route = classifier.invoke({"query": "What are the tradeoffs between LoRA and full fine-tuning?"})
print(route)  # "vector"

route = classifier.invoke({"query": "RuntimeError: CUDA error: device-side assert triggered"})
print(route)  # "keyword"

route = classifier.invoke({"query": "Which model had the highest accuracy last quarter?"})
print(route)  # "sql"

The temperature=0 matters here. You want deterministic classification, not creative answers. Using gpt-4o-mini keeps this step fast and cheap since classification is a simple task.

Build the Retrieval Backends

Each route needs its own retriever. In a real system these would hit different data sources. Here is a working setup with all three.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document

embedding_model = OpenAIEmbeddings(model="text-embedding-3-small")

# --- Vector retriever (semantic search) ---
docs = [
    Document(page_content="LoRA fine-tuning updates low-rank matrices instead of all weights, reducing memory by 10x.", metadata={"source": "fine-tuning-guide"}),
    Document(page_content="Full fine-tuning gives the best quality but requires 4-8x more GPU memory than LoRA.", metadata={"source": "fine-tuning-guide"}),
    Document(page_content="QLoRA combines 4-bit quantization with LoRA for fine-tuning 70B models on a single 48GB GPU.", metadata={"source": "qlora-paper"}),
]

vectorstore = Chroma.from_documents(docs, embedding_model, collection_name="knowledge_base")
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

# --- Keyword retriever (BM25-style exact match) ---
from langchain_community.retrievers import BM25Retriever

keyword_docs = [
    Document(page_content="RuntimeError: CUDA error: device-side assert triggered. Fix: check your label indices are within range of num_classes.", metadata={"source": "error-db"}),
    Document(page_content="torch.cuda.OutOfMemoryError: reduce batch size, enable gradient checkpointing, or use mixed precision training.", metadata={"source": "error-db"}),
    Document(page_content="ImportError: cannot import name 'LLMChain' from 'langchain'. Fix: pip install langchain --upgrade", metadata={"source": "error-db"}),
]

keyword_retriever = BM25Retriever.from_documents(keyword_docs, k=3)

# --- SQL retriever (structured data lookup) ---
def sql_retriever(query: str) -> list[Document]:
    """Simulate a SQL-backed retrieval. In production, this runs actual SQL."""
    # In a real system: generate SQL with LLM, execute against DB, format results
    sample_data = {
        "model_stats": "Model A: 94.2% accuracy (Q4 2025), Model B: 91.8% accuracy (Q4 2025), Model C: 89.5% accuracy (Q4 2025)",
        "user_signups": "Last month: 12,847 signups. Previous month: 11,203 signups. Growth: 14.7%",
    }
    # Return all structured data as context
    return [Document(page_content=v, metadata={"source": "database"}) for v in sample_data.values()]

The BM25 retriever handles keyword matching without embeddings. It excels at error messages, log lines, and anything where exact term overlap matters more than semantic similarity. For the SQL path, you would wire in actual database queries. The pattern shown here is the interface your routing logic expects.

Route Queries to the Right Retriever

Now connect the classifier to the retrievers with a router function.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
def route_and_retrieve(query: str) -> dict:
    """Classify the query, route to the right retriever, return context."""
    route = classifier.invoke({"query": query})
    route = route.strip().lower()

    if route == "keyword":
        docs = keyword_retriever.invoke(query)
    elif route == "sql":
        docs = sql_retriever(query)
    else:
        # Default to vector search
        docs = vector_retriever.invoke(query)

    context = "\n\n".join(doc.page_content for doc in docs)
    return {"query": query, "route": route, "context": context, "docs": docs}

Defaulting to vector search is intentional. If the classifier is uncertain, semantic search is the safest fallback because it handles the widest range of queries.

Add Self-Reflection to Verify Context Quality

This is where agentic RAG separates from basic RAG. Before generating the final answer, the agent evaluates whether the retrieved context actually answers the question. If it does not, the agent can retry with a different retrieval strategy.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
reflection_prompt = ChatPromptTemplate.from_messages([
    ("system", """You are a retrieval quality evaluator. Given a user query and retrieved context, determine if the context contains enough information to answer the query.

Respond with exactly one of:
- "sufficient" — the context directly answers or strongly supports answering the query
- "insufficient" — the context is missing key information or is irrelevant

Be strict. If the context only partially covers the query, respond "insufficient"."""),
    ("human", "Query: {query}\n\nRetrieved Context:\n{context}")
])

reflection_chain = reflection_prompt | llm | StrOutputParser()

def evaluate_context(query: str, context: str) -> bool:
    """Return True if the context is sufficient to answer the query."""
    result = reflection_chain.invoke({"query": query, "context": context})
    return "sufficient" in result.strip().lower()

The Full Agentic Loop

Put it all together. The agent classifies, retrieves, reflects, and either answers or retries with a fallback strategy.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
answer_prompt = ChatPromptTemplate.from_messages([
    ("system", "Answer the user's question using ONLY the provided context. If the context does not contain the answer, say so clearly. Do not make up information."),
    ("human", "Context:\n{context}\n\nQuestion: {query}")
])

answer_chain = answer_prompt | llm | StrOutputParser()

FALLBACK_ORDER = {
    "vector": ["keyword", "sql"],
    "keyword": ["vector", "sql"],
    "sql": ["vector", "keyword"],
}

RETRIEVER_MAP = {
    "vector": lambda q: vector_retriever.invoke(q),
    "keyword": lambda q: keyword_retriever.invoke(q),
    "sql": lambda q: sql_retriever(q),
}


def agentic_rag(query: str, max_retries: int = 2) -> dict:
    """Run the full agentic RAG loop with routing and self-reflection."""
    result = route_and_retrieve(query)
    tried_routes = [result["route"]]

    # Self-reflection: check if context is good enough
    if evaluate_context(query, result["context"]):
        answer = answer_chain.invoke({"query": query, "context": result["context"]})
        return {"answer": answer, "route": result["route"], "retries": 0}

    # Context was insufficient — try fallback routes
    fallbacks = FALLBACK_ORDER.get(result["route"], ["vector"])
    for i, fallback_route in enumerate(fallbacks[:max_retries]):
        if fallback_route in tried_routes:
            continue

        docs = RETRIEVER_MAP[fallback_route](query)
        context = "\n\n".join(doc.page_content for doc in docs)
        tried_routes.append(fallback_route)

        if evaluate_context(query, context):
            answer = answer_chain.invoke({"query": query, "context": context})
            return {"answer": answer, "route": fallback_route, "retries": i + 1}

    # All routes exhausted — answer with best available context
    answer = answer_chain.invoke({"query": query, "context": result["context"]})
    return {
        "answer": answer,
        "route": result["route"],
        "retries": len(tried_routes) - 1,
        "warning": "No retrieval strategy produced sufficient context",
    }


# Run it
response = agentic_rag("What are the tradeoffs between LoRA and full fine-tuning?")
print(f"Route: {response['route']}")
print(f"Retries: {response['retries']}")
print(f"Answer: {response['answer']}")

The max_retries parameter caps how many fallback strategies the agent tries. Two is a good default. Going higher adds latency without much benefit since you only have three retrieval strategies anyway.

Notice the agent always returns an answer, even when all routes fail the reflection check. This is a design choice. In a production system you might want to return a “I don’t have enough information” message instead, or escalate to a human.

Common Errors and Fixes

ValueError: Could not import rank_bm25 python package

BM25Retriever requires the rank_bm25 package separately.

1
pip install rank_bm25

chromadb.errors.InvalidCollectionException: Collection not found

This happens when you try to load a collection that was not persisted. Make sure you pass persist_directory when creating the Chroma instance, or use Chroma.from_documents() to create the collection first.

Classifier returns unexpected values like “I would classify this as vector search”

Your classification prompt is not strict enough. Add Respond with ONLY the category name to the system message and set temperature=0. If the model still rambles, add a one-shot example showing the expected format.

Self-reflection always returns “sufficient” even for bad context

The reflection prompt needs to be strict. Tell it to respond “insufficient” if the context only partially covers the query. You can also add few-shot examples of insufficient context to calibrate the evaluator.

openai.RateLimitError: Rate limit reached

The agentic loop makes multiple LLM calls per query (classification + reflection + answer, potentially doubled on retries). Use gpt-4o-mini for classification and reflection to keep costs and rate limits manageable. Reserve gpt-4o for the final answer generation if you need higher quality.

Classify Queries for Routing#

Build the Retrieval Backends#

Route Queries to the Right Retriever#

Add Self-Reflection to Verify Context Quality#

The Full Agentic Loop#

Common Errors and Fixes#

Related Guides#

About the Author