How to Run Fast LLM Inference with the Groq API

The Quick Version

Groq runs LLMs on custom LPU (Language Processing Unit) chips that are purpose-built for sequential token generation. The result: inference speeds of 500-800 tokens per second — roughly 10x faster than GPU-based providers for the same models. The API is OpenAI-compatible, so switching takes one line of code.

1
2
pip install groq
export GROQ_API_KEY=gsk_your_key_here

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
from groq import Groq

client = Groq()

response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain how a hash table works in 3 sentences."},
    ],
    temperature=0.7,
    max_tokens=256,
)

print(response.choices[0].message.content)
print(f"\nTokens/sec: {response.usage.completion_tokens / response.usage.total_time:.0f}")

That query returns in under a second with a full, coherent answer. The total_time field in the response lets you calculate actual throughput.

Using the OpenAI-Compatible Endpoint

If you already use the OpenAI SDK, point it at Groq’s endpoint. Zero code changes beyond the base URL and API key:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
from openai import OpenAI

# Drop-in replacement — same SDK, different endpoint
client = OpenAI(
    api_key="gsk_your_groq_key",
    base_url="https://api.groq.com/openai/v1",
)

response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "What's the time complexity of quicksort?"}],
    max_tokens=200,
)
print(response.choices[0].message.content)

This means frameworks that support OpenAI (LangChain, LlamaIndex, Autogen) work with Groq out of the box. Just change the base URL.

Available Models and When to Use Them

Groq hosts several open-source models. Pick based on your speed vs. quality needs:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
from groq import Groq

client = Groq()

# Compare models on the same prompt
models = [
    "llama-3.3-70b-versatile",     # Best quality, still very fast
    "llama-3.1-8b-instant",        # Fastest, good for simple tasks
    "mixtral-8x7b-32768",          # 32K context, good for long documents
    "gemma2-9b-it",                # Google's model, strong reasoning
]

prompt = "Write a Python function that checks if a string is a valid IPv4 address."

for model_name in models:
    try:
        response = client.chat.completions.create(
            model=model_name,
            messages=[{"role": "user", "content": prompt}],
            max_tokens=300,
            temperature=0,
        )
        content = response.choices[0].message.content
        tokens = response.usage.completion_tokens
        print(f"\n{model_name}:")
        print(f"  Tokens: {tokens}")
        print(f"  Preview: {content[:100]}...")
    except Exception as e:
        print(f"\n{model_name}: {e}")

Model	Speed	Quality	Context	Best For
llama-3.3-70b-versatile	~500 tok/s	Excellent	128K	Complex reasoning, code
llama-3.1-8b-instant	~800 tok/s	Good	128K	Simple tasks, high throughput
mixtral-8x7b-32768	~600 tok/s	Very good	32K	Long documents, analysis
gemma2-9b-it	~700 tok/s	Good	8K	General chat, instruction following

Streaming for Real-Time Applications

Groq’s streaming is where the speed difference is most noticeable. First tokens arrive in under 100ms:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
from groq import Groq

client = Groq()

stream = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[
        {"role": "user", "content": "Write a step-by-step guide to deploying a FastAPI app to AWS Lambda."},
    ],
    max_tokens=1024,
    stream=True,
)

for chunk in stream:
    content = chunk.choices[0].delta.content
    if content:
        print(content, end="", flush=True)
print()

For web applications, pipe this stream directly to the frontend via Server-Sent Events:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from groq import Groq

app = FastAPI()
client = Groq()

@app.get("/chat")
async def chat(q: str):
    def generate():
        stream = client.chat.completions.create(
            model="llama-3.1-8b-instant",
            messages=[{"role": "user", "content": q}],
            max_tokens=512,
            stream=True,
        )
        for chunk in stream:
            content = chunk.choices[0].delta.content
            if content:
                yield f"data: {content}\n\n"
        yield "data: [DONE]\n\n"

    return StreamingResponse(generate(), media_type="text/event-stream")

Tool Use and Function Calling

Groq supports OpenAI-compatible function calling with Llama models:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import json
from groq import Groq

client = Groq()

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "City name"},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
                },
                "required": ["location"],
            },
        },
    },
]

response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[{"role": "user", "content": "What's the weather in Tokyo?"}],
    tools=tools,
    tool_choice="auto",
)

msg = response.choices[0].message
if msg.tool_calls:
    for tc in msg.tool_calls:
        args = json.loads(tc.function.arguments)
        print(f"Function: {tc.function.name}")
        print(f"Args: {args}")
        # Call your actual function here, then send the result back

JSON Mode for Structured Output

Force the model to return valid JSON — useful for data extraction and API responses:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",
    messages=[
        {
            "role": "system",
            "content": "You are a data extraction API. Always respond in JSON format.",
        },
        {
            "role": "user",
            "content": "Extract the entities from: 'Apple CEO Tim Cook announced the new M4 chip at WWDC in San Jose.'",
        },
    ],
    response_format={"type": "json_object"},
    temperature=0,
)

import json
data = json.loads(response.choices[0].message.content)
print(json.dumps(data, indent=2))

Common Errors and Fixes

RateLimitError: Rate limit reached

Groq has per-minute token and request limits that vary by model and plan. For free tier: ~30 requests/min. Implement exponential backoff:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
import time
from groq import RateLimitError

def query_with_retry(messages, max_retries=3):
    for attempt in range(max_retries):
        try:
            return client.chat.completions.create(
                model="llama-3.3-70b-versatile",
                messages=messages,
            )
        except RateLimitError:
            wait = 2 ** attempt
            print(f"Rate limited, waiting {wait}s...")
            time.sleep(wait)
    raise Exception("Max retries exceeded")

Response cuts off mid-sentence

max_tokens is too low. Groq doesn’t auto-extend — if you set 256 tokens and the answer needs 300, it cuts off. Set a generous limit or check finish_reason == "length" and continue the conversation.

Model not found error

Groq’s model list changes as they add and retire models. Check available models with client.models.list() or the Groq docs. Model IDs are case-sensitive.

Responses differ from the same model on other providers

Groq uses the same model weights but different inference infrastructure. Numerical differences in sampling can produce different outputs even at temperature=0. This is normal and doesn’t indicate a quality issue.

Groq vs. Other Providers

Use Groq when latency matters most — chatbots, real-time agents, interactive applications. At 500+ tokens/second, users perceive responses as near-instant.

Use OpenAI/Anthropic when you need the latest frontier models (GPT-4o, Claude Opus), vision capabilities, or features Groq doesn’t support yet.

Use local inference (Ollama, vLLM) when you need data privacy, have consistent high throughput, or want to avoid per-token costs entirely.

The sweet spot for Groq: applications that need open-source model quality at cloud speed without managing infrastructure.

The Quick Version#

Using the OpenAI-Compatible Endpoint#

Available Models and When to Use Them#

Streaming for Real-Time Applications#

Tool Use and Function Calling#

JSON Mode for Structured Output#

Common Errors and Fixes#

Groq vs. Other Providers#

Related Guides#

About the Author