Cerebras runs LLM inference on their custom wafer-scale chips – purpose-built silicon that pushes token generation speeds well beyond what standard GPU clusters deliver. Their inference API is OpenAI-compatible, so switching from OpenAI or any other provider takes about two lines of code.

The available models include llama3.1-8b, llama-3.3-70b, and gpt-oss-120b. Cerebras reports output speeds above 2,000 tokens per second on these models, which makes real-time applications feel genuinely instant.

Getting Started with Cerebras API

Install the SDK:

1
pip install cerebras-cloud-sdk

Sign up at cloud.cerebras.ai and grab your API key. Set it as an environment variable:

1
export CEREBRAS_API_KEY="your-api-key-here"

Here’s a basic chat completion call:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
import os
from cerebras.cloud.sdk import Cerebras

client = Cerebras(
    api_key=os.environ.get("CEREBRAS_API_KEY"),
)

response = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain wafer-scale computing in two sentences."},
    ],
)

print(response.choices[0].message.content)

The SDK handles retries automatically (two retries by default for connection errors, timeouts, and 429/5xx responses). The default timeout is 60 seconds, which you’ll rarely hit given how fast Cerebras inference is.

One detail worth knowing: the SDK warms the TCP connection by default with a request to /v1/tcp_warming when you create the client. This cuts down first-token latency. If you’re creating the client once and reusing it (which you should), this happens transparently.

Streaming Responses

For chat interfaces or any interactive application, you want streaming. Pass stream=True and iterate over chunks:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
import os
from cerebras.cloud.sdk import Cerebras

client = Cerebras(
    api_key=os.environ.get("CEREBRAS_API_KEY"),
)

stream = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[
        {"role": "user", "content": "Write a Python function to merge two sorted lists."},
    ],
    stream=True,
)

for chunk in stream:
    content = chunk.choices[0].delta.content
    if content:
        print(content, end="", flush=True)

print()  # newline after stream completes

Usage statistics (usage and time_info) only appear in the final chunk, not in every chunk. If you need to track token counts during streaming, collect them from the last chunk.

The SDK also supports async streaming if you’re building with asyncio:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import os
import asyncio
from cerebras.cloud.sdk import AsyncCerebras

client = AsyncCerebras(
    api_key=os.environ.get("CEREBRAS_API_KEY"),
)

async def stream_response():
    stream = await client.chat.completions.create(
        model="llama3.1-8b",
        messages=[
            {"role": "user", "content": "List three uses of wafer-scale chips."},
        ],
        stream=True,
    )
    async for chunk in stream:
        content = chunk.choices[0].delta.content
        if content:
            print(content, end="", flush=True)
    print()

asyncio.run(stream_response())

Using the OpenAI-Compatible Endpoint

This is where Cerebras really shines for existing codebases. If you already use the OpenAI Python SDK, you can point it at Cerebras with zero changes to your application logic:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
import os
from openai import OpenAI

client = OpenAI(
    base_url="https://api.cerebras.ai/v1",
    api_key=os.environ.get("CEREBRAS_API_KEY"),
)

response = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[
        {"role": "user", "content": "What is mixture of experts in LLMs?"},
    ],
    max_tokens=256,
    temperature=0.7,
)

print(response.choices[0].message.content)

That’s the entire migration. Change base_url, swap in a Cerebras API key, pick a Cerebras model name, and everything else stays the same. Streaming works identically through the OpenAI client too.

This also means tools built on the OpenAI SDK – LangChain, LiteLLM, Instructor, and others – work with Cerebras out of the box as long as they let you configure the base URL.

Comparing Inference Speed

Cerebras advertises massive token-per-second numbers. Here’s how to measure it yourself with a simple benchmark script:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
import os
import time
from cerebras.cloud.sdk import Cerebras

client = Cerebras(
    api_key=os.environ.get("CEREBRAS_API_KEY"),
)

prompt = "Explain the transformer architecture in detail, covering attention mechanisms, positional encoding, and layer normalization."

# Measure non-streaming for total time
start = time.perf_counter()
response = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[{"role": "user", "content": prompt}],
    max_tokens=512,
)
end = time.perf_counter()

total_time = end - start
output_text = response.choices[0].message.content
completion_tokens = response.usage.completion_tokens
prompt_tokens = response.usage.prompt_tokens

print(f"Prompt tokens:     {prompt_tokens}")
print(f"Completion tokens: {completion_tokens}")
print(f"Total time:        {total_time:.2f}s")
print(f"Tokens/second:     {completion_tokens / total_time:.0f}")

# Measure streaming for time-to-first-token
start = time.perf_counter()
first_token_time = None

stream = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[{"role": "user", "content": prompt}],
    max_tokens=512,
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content and first_token_time is None:
        first_token_time = time.perf_counter()

end = time.perf_counter()

ttft = first_token_time - start if first_token_time else None
print(f"\nTime to first token: {ttft:.3f}s" if ttft else "\nNo tokens received")
print(f"Total stream time:   {end - start:.2f}s")

You’ll typically see time-to-first-token under 200ms and output speeds of 1,000-2,000+ tokens per second depending on the model. The llama3.1-8b model is the fastest, while gpt-oss-120b trades some speed for stronger reasoning.

Common Errors and Fixes

Authentication failure (401)

1
cerebras.cloud.sdk.AuthenticationError: 401 Unauthorized

Your API key is missing or wrong. Double-check CEREBRAS_API_KEY is set in your environment. The SDK reads it automatically – you don’t need to pass it explicitly if the env var exists.

Model not found (404)

1
cerebras.cloud.sdk.NotFoundError: 404 - Model 'llama-2-70b' not found

Cerebras only serves specific models. Check the current list at inference-docs.cerebras.ai/models/overview. Common mistakes: using old model names like llama3.1-70b (now upgraded to llama-3.3-70b) or requesting models from other providers that Cerebras doesn’t host.

Rate limit hit (429)

1
cerebras.cloud.sdk.RateLimitError: 429 Too Many Requests

The SDK retries 429s automatically (twice by default). If you’re hitting this consistently, add backoff or reduce concurrency. You can increase retries:

1
client = Cerebras(max_retries=5)

Context length exceeded (400)

1
cerebras.cloud.sdk.BadRequestError: 400 - Input exceeds maximum context length

Each model has a context limit. Trim your input or use a model with a larger window. The llama-3.3-70b supports up to 128K tokens, while llama3.1-8b handles the same. If your prompt is genuinely that long, split it into chunks or summarize earlier context before sending.

Connection errors

If you’re behind a corporate proxy or firewall, the SDK uses httpx under the hood. You can configure timeouts:

1
2
3
4
5
import httpx

client = Cerebras(
    timeout=httpx.Timeout(60.0, read=5.0, write=10.0, connect=2.0),
)