How to Build AI Assistants with the Cohere API

Quick Start: Chat with Cohere

Install the SDK and fire off a request. That’s all it takes to get a response from Cohere’s Command model.

1
pip install cohere

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
import cohere

co = cohere.ClientV2()  # reads CO_API_KEY from environment

response = co.chat(
    model="command-a-03-2025",
    messages=[
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "Explain Python generators in 3 sentences."},
    ],
)

print(response.message.content[0].text)

Set your API key as an environment variable first:

1
export CO_API_KEY="your-api-key-here"

The command-a-03-2025 model is Cohere’s strongest general-purpose model. It has a 256k context window, excels at tool use and RAG, and supports structured output via JSON schemas. If you need something lighter, command-r7b-12-2024 handles most tasks at lower cost with a 128k context.

Grounded RAG with Built-In Citations

This is where Cohere pulls ahead of other LLM APIs. Pass documents directly to the chat endpoint and the model returns citations mapping every claim to its source. No prompt hacking, no post-processing. It just works.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
documents = [
    {
        "data": {
            "text": "Python 3.12 introduced type parameter syntax with PEP 695, "
                    "allowing generic classes and functions to use a cleaner syntax.",
            "title": "Python 3.12 Release Notes",
        }
    },
    {
        "data": {
            "text": "The new 'type' statement in Python 3.12 creates type aliases "
                    "more explicitly than the old TypeAlias annotation.",
            "title": "PEP 695 Summary",
        }
    },
]

response = co.chat(
    model="command-a-03-2025",
    messages=[{"role": "user", "content": "What changed with generics in Python 3.12?"}],
    documents=documents,
)

# Print the answer
print(response.message.content[0].text)

# Print citations -- each one maps text spans to source documents
if response.message.citations:
    for cite in response.message.citations:
        print(f"  [{cite.start}:{cite.end}] -> {cite.sources}")

Each citation includes start and end character positions in the generated text, the cited text, and sources pointing back to your documents. You can render these as inline footnotes or highlights in your UI. No other major LLM API gives you this out of the box.

Embeddings with embed-v4.0

Cohere’s embed-v4.0 model supports configurable dimensions (256, 512, 1024, or 1536), handles up to 128k tokens per input, and works with both text and images.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import numpy as np

docs = [
    "Transformers use self-attention to process sequences in parallel.",
    "Recurrent networks process tokens sequentially with hidden states.",
    "Convolutional networks apply learned filters across spatial dimensions.",
]

# Embed documents for storage in a vector DB
doc_embeddings = co.embed(
    model="embed-v4.0",
    input_type="search_document",
    texts=docs,
    embedding_types=["float"],
).embeddings.float

# Embed a query
query_embedding = co.embed(
    model="embed-v4.0",
    input_type="search_query",
    texts=["How do transformers work?"],
    embedding_types=["float"],
).embeddings.float

# Cosine similarity via dot product (embeddings are normalized)
scores = np.dot(query_embedding, np.transpose(doc_embeddings))[0]
ranked = np.argsort(-scores)

for idx in ranked:
    print(f"  {scores[idx]:.4f} | {docs[idx][:60]}...")

The input_type parameter is required. Use "search_document" when embedding your corpus and "search_query" when embedding user queries. This asymmetry improves retrieval accuracy significantly.

Semantic Reranking with Rerank 4

Retrieved 50 documents from a vector search but only want the best 5? Reranking is the answer. Cohere’s reranker is a cross-encoder that scores query-document pairs directly, which is more accurate than embedding similarity alone.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
query = "How do I deploy a FastAPI app to production?"

# Imagine these came from a vector search
candidates = [
    "FastAPI can be deployed using Uvicorn behind Nginx as a reverse proxy.",
    "Flask is a popular micro-framework for building web applications.",
    "Use gunicorn with uvicorn workers for production FastAPI deployments.",
    "Django REST Framework provides serializers for API development.",
    "Docker containers simplify FastAPI deployment across environments.",
]

results = co.rerank(
    model="rerank-v4.0-pro",
    query=query,
    documents=candidates,
    top_n=3,
)

for r in results.results:
    print(f"  Score: {r.relevance_score:.4f} | {candidates[r.index][:60]}...")

The rerank-v4.0-pro model handles documents up to 32k tokens each. For latency-sensitive applications, swap in rerank-v4.0-fast – same interface, lower latency, slightly less accurate.

Full RAG Pipeline: Embed, Rerank, Generate

Here’s the complete pipeline. Embed your documents, retrieve candidates, rerank them, then generate a grounded answer with citations.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import cohere
import numpy as np

co = cohere.ClientV2()

# Your knowledge base
knowledge_base = [
    "Kubernetes pods are the smallest deployable units in a cluster.",
    "A Kubernetes service exposes pods to network traffic via selectors.",
    "Helm charts package Kubernetes manifests for repeatable deployments.",
    "Kustomize lets you customize Kubernetes YAML without templates.",
    "Istio is a service mesh that manages traffic between microservices.",
    "kubectl apply -f deploys resources defined in YAML manifests.",
    "Horizontal Pod Autoscaler adjusts replica count based on CPU usage.",
    "Kubernetes namespaces provide isolation between workloads.",
]

query = "How do I scale pods automatically in Kubernetes?"

# Step 1: Embed and retrieve
doc_emb = co.embed(
    model="embed-v4.0", input_type="search_document",
    texts=knowledge_base, embedding_types=["float"],
).embeddings.float

query_emb = co.embed(
    model="embed-v4.0", input_type="search_query",
    texts=[query], embedding_types=["float"],
).embeddings.float

scores = np.dot(query_emb, np.transpose(doc_emb))[0]
top_indices = np.argsort(-scores)[:5]
retrieved = [knowledge_base[i] for i in top_indices]

# Step 2: Rerank
reranked = co.rerank(
    model="rerank-v4.0-pro", query=query,
    documents=retrieved, top_n=3,
)
final_docs = [
    {"data": {"text": retrieved[r.index]}} for r in reranked.results
]

# Step 3: Generate with citations
response = co.chat(
    model="command-a-03-2025",
    messages=[{"role": "user", "content": query}],
    documents=final_docs,
)

print(response.message.content[0].text)
for cite in response.message.citations or []:
    print(f"  Citation: '{cite.text}' from {cite.sources}")

This three-step pipeline – embed, rerank, generate – is the standard Cohere RAG pattern. The reranking step typically boosts answer quality by 10-20% compared to raw embedding retrieval alone.

Streaming Responses

For real-time UIs, stream tokens as they’re generated instead of waiting for the full response.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
stream = co.chat_stream(
    model="command-a-03-2025",
    messages=[{"role": "user", "content": "Write a haiku about distributed systems."}],
)

for event in stream:
    if event.type == "content-delta":
        print(event.delta.message.content.text, end="", flush=True)
    elif event.type == "citation-start":
        # Handle citation events in RAG streaming
        print(f"\n[citation from: {event.delta.message.citations}]", end="")

The stream emits events in order: message-start, content-start, multiple content-delta events, content-end, and message-end. When using tools, you also get tool-plan-delta, tool-call-start, tool-call-delta, and tool-call-end events.

Tool Use (Function Calling)

Cohere’s tool use works like other LLM APIs but integrates cleanly with the grounding system.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
import json

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a city",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {"type": "string", "description": "City name"},
                    "units": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": "Temperature units",
                    },
                },
                "required": ["city"],
            },
        },
    }
]

response = co.chat(
    model="command-a-03-2025",
    messages=[{"role": "user", "content": "What's the weather in Berlin?"}],
    tools=tools,
)

# The model returns tool calls instead of text
if response.message.tool_calls:
    for tc in response.message.tool_calls:
        name = tc.function.name
        args = json.loads(tc.function.arguments)
        print(f"Call: {name}({args})")

        # Execute the tool, then send results back
        tool_result = {"temperature": 8, "condition": "cloudy", "city": "Berlin"}

        follow_up = co.chat(
            model="command-a-03-2025",
            messages=[
                {"role": "user", "content": "What's the weather in Berlin?"},
                {"role": "assistant", "tool_calls": [tc]},
                {
                    "role": "tool",
                    "tool_call_id": tc.id,
                    "content": json.dumps(tool_result),
                },
            ],
            tools=tools,
        )
        print(follow_up.message.content[0].text)

Cohere vs Other LLM APIs

A few things set Cohere apart:

Built-in RAG grounding: Pass documents to the chat endpoint and get citations automatically. OpenAI and Anthropic require you to stuff documents into the prompt and build your own citation logic.
Reranking as a first-class product: Cohere’s reranker is one of the best cross-encoders available. Other providers don’t offer this, so you’d need Cohere’s reranker or an open-source model like bge-reranker anyway.
Embed model quality: embed-v4.0 with Matryoshka embeddings lets you pick your dimension tradeoff. The 128k context window is unusually large for embedding models.
Enterprise focus: Cohere offers on-premise deployment and data privacy guarantees that matter if you’re building for regulated industries.

The tradeoff? Cohere’s models are strong but not always at the frontier for pure chat quality compared to GPT-4o or Claude. Where they shine is the retrieval stack – embed, rerank, and grounded generation work together seamlessly.

Common Errors

`CohereAPIError: invalid api token`

Your CO_API_KEY is missing or wrong. Double-check it at dashboard.cohere.com/api-keys.

1
export CO_API_KEY="your-actual-key"

`CohereAPIError: model not found`

You’re using an old model ID. Cohere deprecates models regularly. Check the models page for current IDs. Common mistake: using command-r-plus instead of command-r-plus-08-2024 or command-a-03-2025.

`CohereAPIError: invalid value for input_type`

The input_type parameter is required for embed v3+ models. You must specify either "search_document" or "search_query" – omitting it throws an error.

1
2
3
4
5
# Wrong -- missing input_type
co.embed(model="embed-v4.0", texts=["hello"])

# Correct
co.embed(model="embed-v4.0", texts=["hello"], input_type="search_query", embedding_types=["float"])

`TypeError: Client() got an unexpected keyword argument`

You’re mixing v1 and v2 SDK patterns. Use cohere.ClientV2() (not cohere.Client()) for the v2 API. The message format and response structure differ between versions.

Rate Limits

Cohere’s trial tier is generous but has limits. If you hit 429 Too Many Requests, add exponential backoff:

1
2
3
4
5
6
7
8
import time

for attempt in range(5):
    try:
        response = co.chat(model="command-a-03-2025", messages=[...])
        break
    except cohere.TooManyRequestsError:
        time.sleep(2 ** attempt)

Quick Start: Chat with Cohere#

Grounded RAG with Built-In Citations#

Embeddings with embed-v4.0#

Semantic Reranking with Rerank 4#

Full RAG Pipeline: Embed, Rerank, Generate#

Streaming Responses#

Tool Use (Function Calling)#

Cohere vs Other LLM APIs#

Common Errors#

CohereAPIError: invalid api token#

CohereAPIError: model not found#

CohereAPIError: invalid value for input_type#

TypeError: Client() got an unexpected keyword argument#

Rate Limits#

Related Guides#

About the Author