Cerebras runs LLM inference on their custom wafer-scale chips – purpose-built silicon that pushes token generation speeds well beyond what standard GPU clusters deliver. Their inference API is OpenAI-compatible, so switching from OpenAI or any other provider takes about two lines of code.
The available models include llama3.1-8b, llama-3.3-70b, and gpt-oss-120b. Cerebras reports output speeds above 2,000 tokens per second on these models, which makes real-time applications feel genuinely instant.
Getting Started with Cerebras API
Install the SDK:
| |
Sign up at cloud.cerebras.ai and grab your API key. Set it as an environment variable:
| |
Here’s a basic chat completion call:
| |
The SDK handles retries automatically (two retries by default for connection errors, timeouts, and 429/5xx responses). The default timeout is 60 seconds, which you’ll rarely hit given how fast Cerebras inference is.
One detail worth knowing: the SDK warms the TCP connection by default with a request to /v1/tcp_warming when you create the client. This cuts down first-token latency. If you’re creating the client once and reusing it (which you should), this happens transparently.
Streaming Responses
For chat interfaces or any interactive application, you want streaming. Pass stream=True and iterate over chunks:
| |
Usage statistics (usage and time_info) only appear in the final chunk, not in every chunk. If you need to track token counts during streaming, collect them from the last chunk.
The SDK also supports async streaming if you’re building with asyncio:
| |
Using the OpenAI-Compatible Endpoint
This is where Cerebras really shines for existing codebases. If you already use the OpenAI Python SDK, you can point it at Cerebras with zero changes to your application logic:
| |
That’s the entire migration. Change base_url, swap in a Cerebras API key, pick a Cerebras model name, and everything else stays the same. Streaming works identically through the OpenAI client too.
This also means tools built on the OpenAI SDK – LangChain, LiteLLM, Instructor, and others – work with Cerebras out of the box as long as they let you configure the base URL.
Comparing Inference Speed
Cerebras advertises massive token-per-second numbers. Here’s how to measure it yourself with a simple benchmark script:
| |
You’ll typically see time-to-first-token under 200ms and output speeds of 1,000-2,000+ tokens per second depending on the model. The llama3.1-8b model is the fastest, while gpt-oss-120b trades some speed for stronger reasoning.
Common Errors and Fixes
Authentication failure (401)
| |
Your API key is missing or wrong. Double-check CEREBRAS_API_KEY is set in your environment. The SDK reads it automatically – you don’t need to pass it explicitly if the env var exists.
Model not found (404)
| |
Cerebras only serves specific models. Check the current list at inference-docs.cerebras.ai/models/overview. Common mistakes: using old model names like llama3.1-70b (now upgraded to llama-3.3-70b) or requesting models from other providers that Cerebras doesn’t host.
Rate limit hit (429)
| |
The SDK retries 429s automatically (twice by default). If you’re hitting this consistently, add backoff or reduce concurrency. You can increase retries:
| |
Context length exceeded (400)
| |
Each model has a context limit. Trim your input or use a model with a larger window. The llama-3.3-70b supports up to 128K tokens, while llama3.1-8b handles the same. If your prompt is genuinely that long, split it into chunks or summarize earlier context before sending.
Connection errors
If you’re behind a corporate proxy or firewall, the SDK uses httpx under the hood. You can configure timeouts:
| |
Related Guides
- How to Run Fast LLM Inference with the Groq API
- How to Run Open-Source Models with the Replicate API
- How to Use the Stability AI API for Image and Video Generation
- How to Use the AWS Bedrock Converse API for Multi-Model Chat
- How to Use the Anthropic Claude Files API for Large Document Processing
- How to Use the Anthropic Prompt Caching API with Context Blocks
- How to Use the Anthropic Tool Use API for Agentic Workflows
- How to Use the Cohere Rerank API for Search Quality
- How to Use the OpenAI Realtime API for Voice Applications
- How to Use the Anthropic PDF Processing API for Document Analysis