How to Use the Cohere Rerank API for Search Quality

Embedding-based search is fast but sloppy. You encode your query, find the nearest vectors, and hope the top results are actually relevant. They often aren’t. The embedding model optimizes for general similarity, not for answering your specific question.

Reranking fixes this. A cross-encoder model looks at the query and each candidate document together, scoring true relevance instead of vector proximity. It’s the single highest-impact improvement you can make to any search or RAG pipeline. Cohere’s Rerank API gives you a production-ready cross-encoder with one API call.

Here’s the minimal version:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
import cohere

co = cohere.ClientV2(api_key="your-api-key")

results = co.rerank(
    model="rerank-v3.5",
    query="How do I reset my password?",
    documents=[
        "To reset your password, go to Settings > Security > Reset Password.",
        "Our company was founded in 2019 by three engineers.",
        "Password requirements include 8 characters and one special character.",
        "Contact support at [email protected] for account issues.",
    ],
    top_n=2,
)

for result in results.results:
    print(f"Index: {result.index}, Score: {result.relevance_score:.4f}")

The first document scores highest because it directly answers the question. The third document ranks second because it’s password-related but not a direct answer. The other two drop off. That’s what a cross-encoder gives you that embeddings alone can’t.

Installing and Setting Up

Install the Cohere Python SDK:

1
pip install cohere

You need a Cohere API key. Sign up at dashboard.cohere.com and grab your key from the API Keys section. The free tier gives you rate-limited access to all models including rerank.

Initialize the client:

1
2
3
4
5
6
7
8
import cohere

# Option 1: Pass the key directly
co = cohere.ClientV2(api_key="your-api-key")

# Option 2: Set CO_API_KEY environment variable and skip the argument
# export CO_API_KEY=your-api-key
co = cohere.ClientV2()

The ClientV2 class is the current recommended client in the Cohere Python SDK. It targets the v2 API endpoints.

Cohere offers several rerank models:

Model	Context Length	Best For
`rerank-v3.5`	4,096 tokens	General-purpose, low latency
`rerank-v4.0-pro`	32,768 tokens	Long documents, highest accuracy
`rerank-v4.0-fast`	32,768 tokens	Long documents, faster than pro

For most search and RAG use cases, rerank-v3.5 is the sweet spot. It’s fast, cheap, and handles typical chunk sizes well. Use rerank-v4.0-pro when you’re reranking full documents or need the absolute best accuracy on financial, legal, or business text.

Basic Reranking

Here’s a realistic example. Say you have a knowledge base and a user query. Your vector search returns 10 candidates, but the ordering is mediocre:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import cohere

co = cohere.ClientV2(api_key="your-api-key")

query = "What are the health insurance options for part-time employees?"

# These are your search results, in the order your vector DB returned them
search_results = [
    "Full-time employees receive health, dental, and vision coverage starting day one.",
    "Part-time employees working over 20 hours per week are eligible for health insurance.",
    "Our 401k plan matches up to 4% of your salary after one year of employment.",
    "The company picnic is scheduled for July 15th in the main courtyard.",
    "Health insurance plans include PPO and HMO options through Aetna and Blue Cross.",
    "Remote work is available for all employees with manager approval.",
    "Part-time employees can enroll in the dental plan at a subsidized rate.",
    "Open enrollment for health benefits runs from November 1-15 each year.",
    "The employee handbook is available on the internal wiki under HR resources.",
    "Life insurance coverage of 1x annual salary is provided at no cost to full-time staff.",
]

results = co.rerank(
    model="rerank-v3.5",
    query=query,
    documents=search_results,
    top_n=5,
)

print("Reranked results:")
print("-" * 60)
for rank, result in enumerate(results.results, 1):
    doc_text = search_results[result.index]
    print(f"\nRank {rank} (score: {result.relevance_score:.4f}):")
    print(f"  Original position: {result.index}")
    print(f"  {doc_text}")

The reranker pushes the part-time health insurance document to the top, followed by the general health plan options and other insurance-adjacent results. Documents about picnics and remote work drop out entirely.

Notice you access results through results.results. Each item has an index (position in your original list) and a relevance_score between 0 and 1. Higher means more relevant, but the scores aren’t calibrated across queries – a 0.9 on one query doesn’t mean the same thing as a 0.9 on another.

Integrating with a RAG Pipeline

Reranking shines brightest in RAG. The pattern is: retrieve broadly, rerank tightly, then feed only the best documents to your LLM. Here’s a complete pipeline using Cohere for embedding and reranking, with a simple in-memory store:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import cohere
import numpy as np

co = cohere.ClientV2(api_key="your-api-key")

# Your knowledge base
knowledge_base = [
    "Python 3.12 introduced support for sub-interpreters via the new interpreters module.",
    "The GIL in Python prevents true parallel execution of threads for CPU-bound tasks.",
    "asyncio provides cooperative multitasking for I/O-bound Python programs.",
    "multiprocessing spawns separate processes, each with its own GIL, enabling true parallelism.",
    "Threading in Python is useful for I/O-bound tasks despite the GIL limitation.",
    "Cython can release the GIL for C-level computations, enabling thread parallelism.",
    "The free-threaded build of Python 3.13 experimentally removes the GIL.",
    "Ray is a distributed computing framework that scales Python workloads across clusters.",
    "Django uses synchronous request handling by default but supports async views since 4.1.",
    "FastAPI is built on Starlette and supports async request handling natively.",
]

# Step 1: Embed the knowledge base
doc_response = co.embed(
    texts=knowledge_base,
    model="embed-v4.0",
    input_type="search_document",
    embedding_types=["float"],
)
doc_embeddings = np.array(doc_response.embeddings.float_)

# Step 2: Embed the query
query = "How can I run Python code in parallel without the GIL blocking me?"

query_response = co.embed(
    texts=[query],
    model="embed-v4.0",
    input_type="search_query",
    embedding_types=["float"],
)
query_embedding = np.array(query_response.embeddings.float_[0])

# Step 3: Vector search (cosine similarity)
similarities = np.dot(doc_embeddings, query_embedding) / (
    np.linalg.norm(doc_embeddings, axis=1) * np.linalg.norm(query_embedding)
)
top_indices = np.argsort(similarities)[::-1][:7]  # Retrieve top 7 candidates

print("=== Vector search results ===")
candidates = []
for i, idx in enumerate(top_indices):
    print(f"{i+1}. [{similarities[idx]:.3f}] {knowledge_base[idx]}")
    candidates.append(knowledge_base[idx])

# Step 4: Rerank the candidates
rerank_results = co.rerank(
    model="rerank-v3.5",
    query=query,
    documents=candidates,
    top_n=3,
)

print("\n=== After reranking (top 3) ===")
context_docs = []
for rank, result in enumerate(rerank_results.results, 1):
    doc = candidates[result.index]
    print(f"{rank}. [{result.relevance_score:.4f}] {doc}")
    context_docs.append(doc)

# Step 5: Generate answer with the reranked context
context = "\n".join(f"- {doc}" for doc in context_docs)

response = co.chat(
    model="command-r-plus-08-2024",
    messages=[
        {
            "role": "user",
            "content": f"Answer based on this context:\n{context}\n\nQuestion: {query}",
        }
    ],
)

print("\n=== Generated answer ===")
print(response.message.content[0].text)

The vector search gets you into the right neighborhood. The reranker sorts by actual relevance to the question. The LLM only sees the three best documents, which means fewer distracting passages and better answers.

This two-stage retrieve-then-rerank approach is standard practice. Retrieve 20-50 candidates with embeddings (fast and cheap), then rerank to the top 3-5 (slower but much more accurate).

Reranking with Metadata and Filtering

When your documents have structured fields – title, body, source, date – you can tell the reranker which fields to consider. Convert your documents to YAML strings, which the rerank models handle natively:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import cohere
import yaml

co = cohere.ClientV2(api_key="your-api-key")

documents = [
    {
        "title": "Kubernetes Pod Scheduling",
        "body": "Pods are scheduled to nodes based on resource requests, affinity rules, and taints.",
        "source": "infrastructure-docs",
    },
    {
        "title": "Docker Container Networking",
        "body": "Containers communicate through bridge networks by default. Use overlay networks for multi-host setups.",
        "source": "infrastructure-docs",
    },
    {
        "title": "CI/CD Pipeline Best Practices",
        "body": "Run unit tests first, integration tests second. Deploy to staging before production.",
        "source": "engineering-handbook",
    },
    {
        "title": "Kubernetes Resource Limits",
        "body": "Set CPU and memory limits on every pod. Without limits, a single pod can starve the node.",
        "source": "infrastructure-docs",
    },
    {
        "title": "Monitoring with Prometheus",
        "body": "Scrape metrics from /metrics endpoints. Set up alerts for pod restarts and high memory usage.",
        "source": "observability-docs",
    },
]

# Convert to YAML strings for reranking
yaml_docs = [yaml.dump(doc, sort_keys=False) for doc in documents]

query = "How do I set resource limits on Kubernetes pods?"

results = co.rerank(
    model="rerank-v3.5",
    query=query,
    documents=yaml_docs,
    top_n=3,
)

# Filter by relevance score threshold
SCORE_THRESHOLD = 0.1

print("Reranked and filtered results:")
for rank, result in enumerate(results.results, 1):
    if result.relevance_score < SCORE_THRESHOLD:
        print(f"\nRank {rank}: FILTERED (score {result.relevance_score:.4f} below threshold)")
        continue
    doc = documents[result.index]
    print(f"\nRank {rank} (score: {result.relevance_score:.4f}):")
    print(f"  Title: {doc['title']}")
    print(f"  Body: {doc['body']}")
    print(f"  Source: {doc['source']}")

A few things to note about the YAML approach. The rerank model processes both the title and body text together, giving you field-aware ranking without needing separate embeddings for each field. Use sort_keys=False in yaml.dump so your most important fields come first – the model processes fields sequentially, and long documents get truncated from the end.

The relevance score threshold is something you should tune per use case. Start by running 30-50 representative queries, checking which scores correspond to genuinely useful results, and picking a cutoff from there. A threshold of 0.01 to 0.1 is typical depending on your domain.

Common Errors and Fixes

“too many documents” error: The API accepts up to 10,000 document chunks. If your documents are long, each one gets split into multiple chunks internally. With the default max_chunks_per_doc=10, sending 1,500 long documents can exceed the limit. Either reduce the number of documents or truncate them before sending.

1
2
3
4
5
6
7
# Reduce chunk count per document if hitting limits
results = co.rerank(
    model="rerank-v3.5",
    query=query,
    documents=large_doc_list[:1000],
    top_n=10,
)

Empty or None document text: If you pass dictionaries without a text field and don’t convert them to YAML strings, the API throws an error. The rerank endpoint expects either plain strings or YAML-formatted strings. Always convert structured data to YAML before passing it in.

1
2
3
4
5
6
7
# Wrong - passing raw dicts
results = co.rerank(model="rerank-v3.5", query=query, documents=[{"title": "foo"}])

# Right - convert to YAML strings first
import yaml
yaml_docs = [yaml.dump(doc, sort_keys=False) for doc in documents]
results = co.rerank(model="rerank-v3.5", query=query, documents=yaml_docs)

Rate limiting on free tier: The free Cohere plan has rate limits. If you’re reranking in a loop, add a small delay or batch your requests. For production use, the production API key removes most rate limits. You’ll get a 429 status code when throttled – implement exponential backoff.

Scores aren’t comparable across queries: A relevance score of 0.8 on one query doesn’t mean the same as 0.8 on another. Don’t use absolute score thresholds that you tuned on one type of query for a completely different type. Calibrate thresholds per query category if your search handles diverse intent types.

Installing and Setting Up#

Basic Reranking#

Integrating with a RAG Pipeline#

Reranking with Metadata and Filtering#

Common Errors and Fixes#

Related Guides#

About the Author

Installing and Setting Up

Basic Reranking

Integrating with a RAG Pipeline

Reranking with Metadata and Filtering

Common Errors and Fixes

Related Guides