Together AI gives you production-grade access to the best open-source LLMs without managing your own infrastructure. You get Llama 3, Mixtral, DeepSeek, and dozens of other models through a simple API that’s cheaper and often faster than running your own GPU cluster.
Here’s what makes Together worth using: sub-second latency for most models, competitive pricing (often 5-10x cheaper than GPT-4), and they actually keep up with the latest open-source releases. If you’re building with open-source models, this is your shortcut to production.
Getting Started with the Python SDK
Install the Together SDK and set your API key:
| |
The simplest chat completion looks like this:
| |
That’s it. No complex setup, no CUDA drivers, no begging for GPU quota. You’re running a 70B parameter model in production.
Streaming Responses for Real-Time UX
Streaming is critical for chat interfaces. Nobody wants to wait 10 seconds staring at a blank screen. Together’s streaming is fast:
| |
Tokens start flowing in 200-500ms. Compare that to self-hosted models where you’re waiting for cold starts and batching delays.
Async for High-Throughput Workloads
If you’re processing hundreds of prompts (batch classification, data labeling, eval runs), use async to parallelize:
| |
This processes all three prompts concurrently instead of sequentially. For 100 prompts, you go from 100 seconds to under 10.
Function Calling and JSON Mode
Together supports OpenAI-compatible function calling. Use it for tool use, structured data extraction, or agent workflows:
| |
For guaranteed JSON output without function calling, use response_format:
| |
Embeddings for RAG and Semantic Search
Together hosts solid embedding models if you’re building RAG systems:
| |
The m2-bert model handles 8k token context windows and is tuned for retrieval. For production RAG, pair it with a vector store like Pinecone or Qdrant.
Fine-Tuning Your Own Models
Together’s fine-tuning is one of the easiest ways to customize open-source models. Upload your training data (JSONL format with prompt/completion pairs):
| |
Check job status with together fine-tuning list and use your fine-tuned model as soon as it’s ready. Pricing is per-token during training (check their dashboard for current rates).
Pricing and Performance vs Other Providers
Together’s pricing is token-based and varies by model. As of early 2026:
- Llama 3.1 8B: ~$0.10 per 1M input tokens, ~$0.20 per 1M output
- Llama 3.1 70B: ~$0.80 per 1M input, ~$1.20 per 1M output
- Mixtral 8x7B: ~$0.60 per 1M input, ~$0.90 per 1M output
Compare that to GPT-4 Turbo at $10/$30 per 1M tokens. You’re looking at 10-15x cost savings for comparable quality on many tasks.
Latency is where Together shines. They use custom inference infrastructure (not just vLLM) and aggressive caching. Typical time-to-first-token for Llama 3.1 70B is under 400ms, total completion time for 512 tokens is 2-4 seconds. That’s competitive with OpenAI on smaller models.
Throughput limits depend on your plan. Free tier gets you 60 requests/minute. Paid plans scale to thousands of concurrent requests.
Best Practices for Production
Model selection: Start with Llama 3.1 8B for simple tasks (classification, summarization). Upgrade to 70B when you need better reasoning or complex instructions. Mixtral is the sweet spot for balanced cost/quality. DeepSeek V3 is excellent for code generation.
Caching: Together caches prompts server-side. Reusing system messages and common prefixes reduces latency and cost. Structure your prompts to maximize cache hits.
Error handling: Always wrap API calls in try/except. Together returns standard HTTP errors. Watch for 429 (rate limit) and 503 (temporary overload):
| |
Monitoring: Log token usage per request. Together’s response objects include usage.prompt_tokens and usage.completion_tokens. Track these to optimize costs and catch prompt bloat.
Common Errors and Fixes
Error: “Invalid model name”
Together’s model naming uses the HuggingFace convention: org/model-name. Check their docs for the exact string. Common mistake: using llama-3 instead of meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo.
Error: “Context length exceeded”
Each model has a max context window (8k, 32k, 128k). Count your tokens before sending. Use tiktoken or Together’s tokenizer to estimate:
| |
Error: “Rate limit exceeded” (429) You’re hitting requests-per-minute limits. Implement exponential backoff (see Best Practices). For sustained high throughput, upgrade your plan or contact Together for higher limits.
Slow responses or timeouts
If you’re seeing 10+ second responses, check your max_tokens setting. Generating 4096 tokens takes longer than 512. Also verify network latency—if you’re in Asia and Together’s servers are US-based, expect 200-300ms extra RTT. Use streaming to improve perceived latency.
Inconsistent output quality Lower temperature (0.1-0.3) for factual tasks, higher (0.7-0.9) for creative tasks. If the model ignores your system message, put critical instructions in the user message too. Some models follow system prompts better than others—Llama 3.1 70B is more instruction-following than 8B.
Related Guides
- How to Use the Voyage AI API for Code and Text Embeddings
- How to Use the Anthropic Token Efficient Tool Use API
- How to Use the Fireworks AI API for Fast Open-Source LLMs
- How to Use the DeepSeek API for Code and Reasoning Tasks
- How to Run Open-Source Models with the Replicate API
- How to Use the AWS Bedrock Converse API for Multi-Model Chat
- How to Use the OpenAI Realtime API for Voice Applications
- How to Use the Cerebras API for Fast LLM Inference
- How to Use the OpenRouter API for Multi-Provider LLM Access
- How to Use the Anthropic Token Counting API for Cost Estimation