The Quick Version#
Groq runs LLMs on custom LPU (Language Processing Unit) chips that are purpose-built for sequential token generation. The result: inference speeds of 500-800 tokens per second — roughly 10x faster than GPU-based providers for the same models. The API is OpenAI-compatible, so switching takes one line of code.
1
2
| pip install groq
export GROQ_API_KEY=gsk_your_key_here
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| from groq import Groq
client = Groq()
response = client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain how a hash table works in 3 sentences."},
],
temperature=0.7,
max_tokens=256,
)
print(response.choices[0].message.content)
print(f"\nTokens/sec: {response.usage.completion_tokens / response.usage.total_time:.0f}")
|
That query returns in under a second with a full, coherent answer. The total_time field in the response lets you calculate actual throughput.
Using the OpenAI-Compatible Endpoint#
If you already use the OpenAI SDK, point it at Groq’s endpoint. Zero code changes beyond the base URL and API key:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| from openai import OpenAI
# Drop-in replacement — same SDK, different endpoint
client = OpenAI(
api_key="gsk_your_groq_key",
base_url="https://api.groq.com/openai/v1",
)
response = client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[{"role": "user", "content": "What's the time complexity of quicksort?"}],
max_tokens=200,
)
print(response.choices[0].message.content)
|
This means frameworks that support OpenAI (LangChain, LlamaIndex, Autogen) work with Groq out of the box. Just change the base URL.
Available Models and When to Use Them#
Groq hosts several open-source models. Pick based on your speed vs. quality needs:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
| from groq import Groq
client = Groq()
# Compare models on the same prompt
models = [
"llama-3.3-70b-versatile", # Best quality, still very fast
"llama-3.1-8b-instant", # Fastest, good for simple tasks
"mixtral-8x7b-32768", # 32K context, good for long documents
"gemma2-9b-it", # Google's model, strong reasoning
]
prompt = "Write a Python function that checks if a string is a valid IPv4 address."
for model_name in models:
try:
response = client.chat.completions.create(
model=model_name,
messages=[{"role": "user", "content": prompt}],
max_tokens=300,
temperature=0,
)
content = response.choices[0].message.content
tokens = response.usage.completion_tokens
print(f"\n{model_name}:")
print(f" Tokens: {tokens}")
print(f" Preview: {content[:100]}...")
except Exception as e:
print(f"\n{model_name}: {e}")
|
| Model | Speed | Quality | Context | Best For |
|---|
| llama-3.3-70b-versatile | ~500 tok/s | Excellent | 128K | Complex reasoning, code |
| llama-3.1-8b-instant | ~800 tok/s | Good | 128K | Simple tasks, high throughput |
| mixtral-8x7b-32768 | ~600 tok/s | Very good | 32K | Long documents, analysis |
| gemma2-9b-it | ~700 tok/s | Good | 8K | General chat, instruction following |
Streaming for Real-Time Applications#
Groq’s streaming is where the speed difference is most noticeable. First tokens arrive in under 100ms:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
| from groq import Groq
client = Groq()
stream = client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[
{"role": "user", "content": "Write a step-by-step guide to deploying a FastAPI app to AWS Lambda."},
],
max_tokens=1024,
stream=True,
)
for chunk in stream:
content = chunk.choices[0].delta.content
if content:
print(content, end="", flush=True)
print()
|
For web applications, pipe this stream directly to the frontend via Server-Sent Events:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
| from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from groq import Groq
app = FastAPI()
client = Groq()
@app.get("/chat")
async def chat(q: str):
def generate():
stream = client.chat.completions.create(
model="llama-3.1-8b-instant",
messages=[{"role": "user", "content": q}],
max_tokens=512,
stream=True,
)
for chunk in stream:
content = chunk.choices[0].delta.content
if content:
yield f"data: {content}\n\n"
yield "data: [DONE]\n\n"
return StreamingResponse(generate(), media_type="text/event-stream")
|
Groq supports OpenAI-compatible function calling with Llama models:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
| import json
from groq import Groq
client = Groq()
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string", "description": "City name"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
},
"required": ["location"],
},
},
},
]
response = client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
tools=tools,
tool_choice="auto",
)
msg = response.choices[0].message
if msg.tool_calls:
for tc in msg.tool_calls:
args = json.loads(tc.function.arguments)
print(f"Function: {tc.function.name}")
print(f"Args: {args}")
# Call your actual function here, then send the result back
|
JSON Mode for Structured Output#
Force the model to return valid JSON — useful for data extraction and API responses:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
| response = client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[
{
"role": "system",
"content": "You are a data extraction API. Always respond in JSON format.",
},
{
"role": "user",
"content": "Extract the entities from: 'Apple CEO Tim Cook announced the new M4 chip at WWDC in San Jose.'",
},
],
response_format={"type": "json_object"},
temperature=0,
)
import json
data = json.loads(response.choices[0].message.content)
print(json.dumps(data, indent=2))
|
Common Errors and Fixes#
RateLimitError: Rate limit reached
Groq has per-minute token and request limits that vary by model and plan. For free tier: ~30 requests/min. Implement exponential backoff:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| import time
from groq import RateLimitError
def query_with_retry(messages, max_retries=3):
for attempt in range(max_retries):
try:
return client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=messages,
)
except RateLimitError:
wait = 2 ** attempt
print(f"Rate limited, waiting {wait}s...")
time.sleep(wait)
raise Exception("Max retries exceeded")
|
Response cuts off mid-sentence
max_tokens is too low. Groq doesn’t auto-extend — if you set 256 tokens and the answer needs 300, it cuts off. Set a generous limit or check finish_reason == "length" and continue the conversation.
Model not found error
Groq’s model list changes as they add and retire models. Check available models with client.models.list() or the Groq docs. Model IDs are case-sensitive.
Responses differ from the same model on other providers
Groq uses the same model weights but different inference infrastructure. Numerical differences in sampling can produce different outputs even at temperature=0. This is normal and doesn’t indicate a quality issue.
Groq vs. Other Providers#
Use Groq when latency matters most — chatbots, real-time agents, interactive applications. At 500+ tokens/second, users perceive responses as near-instant.
Use OpenAI/Anthropic when you need the latest frontier models (GPT-4o, Claude Opus), vision capabilities, or features Groq doesn’t support yet.
Use local inference (Ollama, vLLM) when you need data privacy, have consistent high throughput, or want to avoid per-token costs entirely.
The sweet spot for Groq: applications that need open-source model quality at cloud speed without managing infrastructure.