Fireworks AI is one of the fastest inference providers for open-source LLMs. They host Llama 3.1, Mixtral, DeepSeek, and dozens of other models behind an OpenAI-compatible API. That means you can swap out your OpenAI client’s base_url, point it at Fireworks, and get sub-second responses from 70B parameter models.
Install the OpenAI SDK and set your Fireworks API key:
| |
Then create a client:
| |
That client works with client.chat.completions.create() exactly like OpenAI’s. The only difference is the model ID format: Fireworks uses accounts/fireworks/models/<model-name> instead of gpt-4o.
Chat Completions with Llama 3.1 and Mixtral
Here is a straightforward chat completion call using Llama 3.1 70B Instruct:
| |
Swap the model ID to use Mixtral 8x22B instead:
| |
One thing to note: if your prompt plus max_tokens exceeds the model’s context window, Fireworks automatically reduces max_tokens rather than throwing an error. OpenAI would return a 400 error in the same situation. This is actually convenient – you don’t need to calculate remaining tokens yourself.
Structured Output and JSON Mode
Fireworks supports two approaches for getting structured responses: JSON mode (free-form JSON) and JSON schema mode (enforced structure).
For basic JSON mode, set response_format to {"type": "json_object"} and tell the model to output JSON in your prompt:
| |
For stricter control, define a JSON schema with a Pydantic model and pass it via response_format:
| |
Include your schema description in the prompt too. Fireworks enforces the schema during generation, so the output will always be valid JSON matching your structure. One limitation: oneOf composition and string length constraints like minLength are not supported yet.
Function Calling
Llama 3.1 models on Fireworks support function calling through the standard OpenAI tools parameter. Define your tools and let the model decide when to call them:
| |
Keep temperature low (0.1 or so) for function calling. Higher temperatures cause the model to hallucinate function arguments or call the wrong function.
Streaming Responses
Streaming works exactly like the OpenAI SDK. Set stream=True and iterate over the chunks:
| |
Fireworks returns token usage stats in the final streamed chunk, which is a nice bonus. OpenAI also does this now, but Fireworks has supported it longer. You can grab it from the last chunk’s usage field for cost tracking.
Latency and Pricing Compared to Other Providers
Fireworks is optimized for speed. Their inference stack uses custom kernels and speculative decoding under the hood. For Llama 3.1 70B, you can expect time-to-first-token under 200ms and throughput above 70 tokens per second on a warm endpoint. That puts them in the same speed tier as Groq and Together AI.
Here is a rough pricing comparison for 70B-class models (per 1M tokens):
| Provider | Model | Input | Output |
|---|---|---|---|
| Fireworks | Llama 3.1 70B | $0.90 | $0.90 |
| Fireworks | Mixtral 8x22B | $1.20 | $1.20 |
| Together AI | Llama 3.1 70B | $0.88 | $0.88 |
| Groq | Llama 3.1 70B | $0.59 | $0.79 |
| Amazon Bedrock | Llama 3.1 70B | $0.72 | $0.72 |
Fireworks also offers 50% discounts on cached input tokens (similar to Anthropic’s prompt caching) and 50% off for batch inference jobs. If you are running high-volume workloads, batch mode at $0.45 per 1M tokens for Llama 3.1 70B is hard to beat.
The real differentiator is not raw price but the combination of speed, OpenAI compatibility, and feature completeness. Fireworks supports function calling, structured outputs, and streaming on open-source models where some providers only give you basic chat completions.
Common Errors and Fixes
AuthenticationError: Invalid API key – Make sure you are passing your Fireworks API key, not your OpenAI key. The environment variable should be FIREWORKS_API_KEY, and you need to pass it explicitly since the OpenAI SDK defaults to reading OPENAI_API_KEY.
Model not found – Fireworks model IDs use the format accounts/fireworks/models/<model-name>. A common mistake is passing just the model name like llama-v3p1-70b-instruct without the full path prefix.
400 Bad Request on function calling – Not all models on Fireworks support function calling. Stick to Llama 3.1 models or Fireworks’ own FireFunction models. Mixtral 8x22B Instruct does not reliably handle tool calls.
JSON mode returns plain text – You must include instructions to output JSON in the prompt itself, not just set the response_format. Fireworks enforces the format during generation, but the model still needs prompting to know what JSON to produce.
Streaming hangs or times out – Set a reasonable max_tokens value. Without it, the model may generate until hitting the full context window, which takes longer than you would expect on a 128K context model.
response_format with json_schema fails – Check that your schema does not use unsupported features like oneOf, anyOf, or string length constraints (minLength, maxLength). Simplify the schema and test again.
Related Guides
- How to Use the Together AI API for Open-Source LLMs
- How to Use the OpenAI Realtime API for Voice Applications
- How to Use the OpenRouter API for Multi-Provider LLM Access
- How to Use the xAI Grok API for Chat and Function Calling
- How to Use the Anthropic Multi-Turn Conversation API with Tool Use
- How to Use the Anthropic Tool Use API for Agentic Workflows
- How to Use the Perplexity API for AI-Powered Search
- How to Run Open-Source Models with the Replicate API
- How to Use the AWS Bedrock Converse API for Multi-Model Chat
- How to Build Apps with the Gemini API and Python SDK