Most RAG pipelines are dumb pipes: embed the query, fetch top-K, stuff into a prompt. That works until your corpus grows and retrieval noise drowns the signal. A retrieval agent fixes this by putting an LLM in the driver’s seat – it decides when to search, reranks what comes back, and generates a grounded answer with citations.
Here’s the stack: Claude for tool calling and answer generation, ChromaDB for vector search, and Cohere Rerank to filter out the noise. Install everything first:
1
| pip install anthropic chromadb cohere sentence-transformers
|
Set Up the Vector Store#
We need a document collection to search against. ChromaDB handles embedding and storage in one shot.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
| import chromadb
from chromadb.utils import embedding_functions
# Use a local sentence-transformers model for embeddings
ef = embedding_functions.SentenceTransformerEmbeddingFunction(
model_name="all-MiniLM-L6-v2"
)
chroma_client = chromadb.Client()
collection = chroma_client.create_collection(
name="knowledge_base",
embedding_function=ef,
)
# Sample documents -- replace with your actual corpus
documents = [
"FAISS supports both CPU and GPU indexing. For datasets under 1M vectors, a flat index works fine. Beyond that, use IVF with nprobe tuning.",
"Cohere Rerank v3 accepts up to 1000 documents per request. It returns a relevance score between 0 and 1 for each document relative to the query.",
"Claude's tool calling works by defining tools in the API request. The model returns a tool_use block when it wants to call a tool, and you send results back as tool_result.",
"Retrieval-augmented generation reduces hallucination by grounding answers in retrieved documents. The key bottleneck is retrieval quality, not generation.",
"ChromaDB stores embeddings locally with automatic persistence. It supports metadata filtering and can use custom embedding functions.",
"Cross-encoder rerankers score query-document pairs jointly, unlike bi-encoders which encode them separately. This makes them slower but much more accurate for reranking.",
"Vector search recall drops significantly when the query is ambiguous or multi-hop. Query decomposition helps -- break the question into sub-queries.",
"The sentence-transformers library provides pre-trained models for semantic search. all-MiniLM-L6-v2 is a good balance of speed and quality.",
]
doc_ids = [f"doc_{i}" for i in range(len(documents))]
collection.add(documents=documents, ids=doc_ids)
|
Claude needs a tool definition to know it can search. The tool takes a query string and returns relevant documents. We do the vector search + Cohere reranking inside the tool function.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
| import cohere
import json
co = cohere.Client() # Uses COHERE_API_KEY env var
search_tool = {
"name": "search_knowledge_base",
"description": (
"Search the internal knowledge base for information. Use this when "
"the user asks a factual question that requires looking up documentation. "
"Do NOT use this for simple greetings or opinions."
),
"input_schema": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "The search query to find relevant documents",
}
},
"required": ["query"],
},
}
def search_and_rerank(query: str, top_k: int = 5, top_n: int = 3) -> str:
"""Search ChromaDB, rerank with Cohere, return top results."""
# Step 1: Vector search -- cast a wide net
results = collection.query(query_texts=[query], n_results=top_k)
candidate_docs = results["documents"][0]
if not candidate_docs:
return json.dumps({"results": [], "message": "No documents found."})
# Step 2: Rerank with Cohere -- narrow to the best matches
rerank_response = co.rerank(
model="rerank-v3.5",
query=query,
documents=candidate_docs,
top_n=top_n,
)
reranked = []
for hit in rerank_response.results:
reranked.append({
"text": candidate_docs[hit.index],
"relevance_score": round(hit.relevance_score, 4),
})
return json.dumps({"query": query, "results": reranked})
|
The key insight: vector search retrieves top_k=5 candidates (high recall, lower precision), then Cohere reranks and keeps only top_n=3 (high precision). This two-stage approach consistently beats single-stage retrieval.
Build the Agent Loop#
The agent loop sends the user’s question to Claude with the search tool. Claude either answers directly or calls the tool. When it calls the tool, we execute the search-and-rerank pipeline and feed the results back.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
| from anthropic import Anthropic
client = Anthropic()
SYSTEM_PROMPT = (
"You are a helpful research assistant. When answering factual questions, "
"search the knowledge base first. Cite the specific documents you used. "
"If the search results don't contain the answer, say so honestly."
)
def run_retrieval_agent(user_query: str, max_turns: int = 5) -> str:
"""Run the retrieval agent with tool calling and reranking."""
messages = [{"role": "user", "content": user_query}]
tools = [search_tool]
for turn in range(max_turns):
response = client.messages.create(
model="claude-sonnet-4-5-20250514",
max_tokens=1024,
system=SYSTEM_PROMPT,
tools=tools,
messages=messages,
)
# If Claude wants to use a tool, execute it
if response.stop_reason == "tool_use":
assistant_content = response.content
tool_results = []
for block in assistant_content:
if block.type == "tool_use":
print(f" [Search] query={block.input['query']!r}")
result = search_and_rerank(block.input["query"])
print(f" [Results] {result[:200]}...")
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": result,
})
messages.append({"role": "assistant", "content": assistant_content})
messages.append({"role": "user", "content": tool_results})
else:
# Claude responded with text -- extract and return
return "".join(
block.text for block in response.content if hasattr(block, "text")
)
return "Agent did not converge within the turn limit."
|
Run it:
1
2
3
4
| answer = run_retrieval_agent("How does Cohere Rerank work and what are its limits?")
print(answer)
# [Search] query='Cohere Rerank capabilities and limits'
# Claude returns a grounded answer citing the retrieved documents
|
Swap Cohere for a Local Cross-Encoder#
If you want to avoid API costs or keep everything on-prem, replace the Cohere call with a cross-encoder from sentence-transformers. The trade-off: it’s slower on CPU but free and private.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
| from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def search_and_rerank_local(query: str, top_k: int = 5, top_n: int = 3) -> str:
"""Search ChromaDB, rerank with a local cross-encoder."""
results = collection.query(query_texts=[query], n_results=top_k)
candidate_docs = results["documents"][0]
if not candidate_docs:
return json.dumps({"results": [], "message": "No documents found."})
# Score each query-document pair with the cross-encoder
pairs = [[query, doc] for doc in candidate_docs]
scores = reranker.predict(pairs)
# Sort by score descending, take top_n
scored_docs = sorted(
zip(candidate_docs, scores), key=lambda x: x[1], reverse=True
)[:top_n]
reranked = [
{"text": doc, "relevance_score": round(float(score), 4)}
for doc, score in scored_docs
]
return json.dumps({"query": query, "results": reranked})
|
My recommendation: use Cohere Rerank in production (it’s fast, accurate, and the API handles batching well). Use the local cross-encoder for prototyping, testing, or when you can’t send data to an external API.
Common Errors and Fixes#
chromadb.errors.DuplicateIDError: ID doc_0 already exists
You’re calling collection.add() with IDs that already exist. Use collection.upsert() instead, or check first with collection.get().
1
2
| # Fix: use upsert instead of add
collection.upsert(documents=documents, ids=doc_ids)
|
cohere.errors.TooManyRequestsError: You exceeded your rate limit
Cohere’s free tier allows 10 rerank calls per minute. Either add a retry with backoff or batch your documents into fewer calls.
1
2
3
4
5
6
7
8
9
10
11
| import time
def rerank_with_retry(query, docs, max_retries=3):
for attempt in range(max_retries):
try:
return co.rerank(model="rerank-v3.5", query=query, documents=docs, top_n=3)
except cohere.errors.TooManyRequestsError:
wait = 2 ** attempt
print(f"Rate limited, waiting {wait}s...")
time.sleep(wait)
raise RuntimeError("Rerank failed after retries")
|
anthropic.BadRequestError: tools.0.input_schema must be a valid JSON Schema
Your tool’s input_schema has a structural problem. Common cause: missing "type": "object" at the top level, or using "required" outside the "properties" block. Double-check that the schema matches the exact format shown above.
ValueError: could not convert string to float from CrossEncoder
This usually means your pairs list has None values. ChromaDB can return None for documents that were deleted. Filter them out before scoring:
1
| pairs = [[query, doc] for doc in candidate_docs if doc is not None]
|
Search returns irrelevant results even after reranking
Your embedding model might not be a good fit for your domain. all-MiniLM-L6-v2 is a general-purpose model. For domain-specific content, fine-tune an embedding model or switch to a larger one like all-mpnet-base-v2. Also check that your top_k isn’t too small – if the right document doesn’t make it past the vector search stage, reranking can’t save it.