Fireworks AI is one of the fastest inference providers for open-source LLMs. They host Llama 3.1, Mixtral, DeepSeek, and dozens of other models behind an OpenAI-compatible API. That means you can swap out your OpenAI client’s base_url, point it at Fireworks, and get sub-second responses from 70B parameter models.

Install the OpenAI SDK and set your Fireworks API key:

1
2
pip install openai
export FIREWORKS_API_KEY="your-fireworks-api-key"

Then create a client:

1
2
3
4
5
6
7
import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["FIREWORKS_API_KEY"],
    base_url="https://api.fireworks.ai/inference/v1",
)

That client works with client.chat.completions.create() exactly like OpenAI’s. The only difference is the model ID format: Fireworks uses accounts/fireworks/models/<model-name> instead of gpt-4o.

Chat Completions with Llama 3.1 and Mixtral

Here is a straightforward chat completion call using Llama 3.1 70B Instruct:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["FIREWORKS_API_KEY"],
    base_url="https://api.fireworks.ai/inference/v1",
)

response = client.chat.completions.create(
    model="accounts/fireworks/models/llama-v3p1-70b-instruct",
    messages=[
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "Write a Python function to flatten a nested list."},
    ],
    max_tokens=512,
    temperature=0.7,
)

print(response.choices[0].message.content)

Swap the model ID to use Mixtral 8x22B instead:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
response = client.chat.completions.create(
    model="accounts/fireworks/models/mixtral-8x22b-instruct",
    messages=[
        {"role": "user", "content": "Explain the difference between TCP and UDP in two sentences."},
    ],
    max_tokens=256,
    temperature=0.3,
)

print(response.choices[0].message.content)

One thing to note: if your prompt plus max_tokens exceeds the model’s context window, Fireworks automatically reduces max_tokens rather than throwing an error. OpenAI would return a 400 error in the same situation. This is actually convenient – you don’t need to calculate remaining tokens yourself.

Structured Output and JSON Mode

Fireworks supports two approaches for getting structured responses: JSON mode (free-form JSON) and JSON schema mode (enforced structure).

For basic JSON mode, set response_format to {"type": "json_object"} and tell the model to output JSON in your prompt:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
response = client.chat.completions.create(
    model="accounts/fireworks/models/llama-v3p1-70b-instruct",
    response_format={"type": "json_object"},
    messages=[
        {
            "role": "user",
            "content": "List the top 3 Python web frameworks with their main use case. Reply in JSON format.",
        }
    ],
    max_tokens=256,
)

print(response.choices[0].message.content)
# {"frameworks": [{"name": "Django", "use_case": "Full-stack web applications"}, ...]}

For stricter control, define a JSON schema with a Pydantic model and pass it via response_format:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
from pydantic import BaseModel

class MovieReview(BaseModel):
    title: str
    rating: float
    summary: str
    recommended: bool

response = client.chat.completions.create(
    model="accounts/fireworks/models/llama-v3p1-70b-instruct",
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "MovieReview",
            "schema": MovieReview.model_json_schema(),
        },
    },
    messages=[
        {
            "role": "user",
            "content": "Review the movie 'Inception' as JSON matching this schema: title, rating (0-10), summary, recommended (bool).",
        }
    ],
    max_tokens=256,
)

import json
data = json.loads(response.choices[0].message.content)
review = MovieReview(**data)
print(f"{review.title}: {review.rating}/10 - Recommended: {review.recommended}")

Include your schema description in the prompt too. Fireworks enforces the schema during generation, so the output will always be valid JSON matching your structure. One limitation: oneOf composition and string length constraints like minLength are not supported yet.

Function Calling

Llama 3.1 models on Fireworks support function calling through the standard OpenAI tools parameter. Define your tools and let the model decide when to call them:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
import json

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a city",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {
                        "type": "string",
                        "description": "The city name, e.g. San Francisco",
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": "Temperature unit",
                    },
                },
                "required": ["city"],
            },
        },
    }
]

response = client.chat.completions.create(
    model="accounts/fireworks/models/llama-v3p1-70b-instruct",
    messages=[
        {"role": "user", "content": "What's the weather like in Tokyo?"},
    ],
    tools=tools,
    tool_choice="auto",
    temperature=0.1,
)

message = response.choices[0].message

if message.tool_calls:
    for tool_call in message.tool_calls:
        print(f"Function: {tool_call.function.name}")
        print(f"Arguments: {tool_call.function.arguments}")
        args = json.loads(tool_call.function.arguments)
        # Call your actual weather function here, then send the result back
        result = {"temperature": 22, "condition": "partly cloudy", "city": args["city"]}

        # Send the tool result back to the model
        followup = client.chat.completions.create(
            model="accounts/fireworks/models/llama-v3p1-70b-instruct",
            messages=[
                {"role": "user", "content": "What's the weather like in Tokyo?"},
                message,
                {
                    "role": "tool",
                    "tool_call_id": tool_call.id,
                    "content": json.dumps(result),
                },
            ],
            temperature=0.1,
        )
        print(followup.choices[0].message.content)

Keep temperature low (0.1 or so) for function calling. Higher temperatures cause the model to hallucinate function arguments or call the wrong function.

Streaming Responses

Streaming works exactly like the OpenAI SDK. Set stream=True and iterate over the chunks:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
stream = client.chat.completions.create(
    model="accounts/fireworks/models/llama-v3p1-70b-instruct",
    messages=[
        {"role": "user", "content": "Explain how garbage collection works in Python."},
    ],
    max_tokens=512,
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta
    if delta.content:
        print(delta.content, end="", flush=True)
print()

Fireworks returns token usage stats in the final streamed chunk, which is a nice bonus. OpenAI also does this now, but Fireworks has supported it longer. You can grab it from the last chunk’s usage field for cost tracking.

Latency and Pricing Compared to Other Providers

Fireworks is optimized for speed. Their inference stack uses custom kernels and speculative decoding under the hood. For Llama 3.1 70B, you can expect time-to-first-token under 200ms and throughput above 70 tokens per second on a warm endpoint. That puts them in the same speed tier as Groq and Together AI.

Here is a rough pricing comparison for 70B-class models (per 1M tokens):

ProviderModelInputOutput
FireworksLlama 3.1 70B$0.90$0.90
FireworksMixtral 8x22B$1.20$1.20
Together AILlama 3.1 70B$0.88$0.88
GroqLlama 3.1 70B$0.59$0.79
Amazon BedrockLlama 3.1 70B$0.72$0.72

Fireworks also offers 50% discounts on cached input tokens (similar to Anthropic’s prompt caching) and 50% off for batch inference jobs. If you are running high-volume workloads, batch mode at $0.45 per 1M tokens for Llama 3.1 70B is hard to beat.

The real differentiator is not raw price but the combination of speed, OpenAI compatibility, and feature completeness. Fireworks supports function calling, structured outputs, and streaming on open-source models where some providers only give you basic chat completions.

Common Errors and Fixes

AuthenticationError: Invalid API key – Make sure you are passing your Fireworks API key, not your OpenAI key. The environment variable should be FIREWORKS_API_KEY, and you need to pass it explicitly since the OpenAI SDK defaults to reading OPENAI_API_KEY.

Model not found – Fireworks model IDs use the format accounts/fireworks/models/<model-name>. A common mistake is passing just the model name like llama-v3p1-70b-instruct without the full path prefix.

400 Bad Request on function calling – Not all models on Fireworks support function calling. Stick to Llama 3.1 models or Fireworks’ own FireFunction models. Mixtral 8x22B Instruct does not reliably handle tool calls.

JSON mode returns plain text – You must include instructions to output JSON in the prompt itself, not just set the response_format. Fireworks enforces the format during generation, but the model still needs prompting to know what JSON to produce.

Streaming hangs or times out – Set a reasonable max_tokens value. Without it, the model may generate until hitting the full context window, which takes longer than you would expect on a 128K context model.

response_format with json_schema fails – Check that your schema does not use unsupported features like oneOf, anyOf, or string length constraints (minLength, maxLength). Simplify the schema and test again.