How to Use the Together AI API for Open-Source LLMs

Together AI gives you production-grade access to the best open-source LLMs without managing your own infrastructure. You get Llama 3, Mixtral, DeepSeek, and dozens of other models through a simple API that’s cheaper and often faster than running your own GPU cluster.

Here’s what makes Together worth using: sub-second latency for most models, competitive pricing (often 5-10x cheaper than GPT-4), and they actually keep up with the latest open-source releases. If you’re building with open-source models, this is your shortcut to production.

Getting Started with the Python SDK

Install the Together SDK and set your API key:

1
2
pip install together
export TOGETHER_API_KEY="your-api-key-here"

The simplest chat completion looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
from together import Together

client = Together()

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
    messages=[
        {"role": "system", "content": "You are a helpful assistant specialized in Python."},
        {"role": "user", "content": "Write a function to reverse a string."}
    ],
    max_tokens=512,
    temperature=0.7
)

print(response.choices[0].message.content)

That’s it. No complex setup, no CUDA drivers, no begging for GPU quota. You’re running a 70B parameter model in production.

Streaming Responses for Real-Time UX

Streaming is critical for chat interfaces. Nobody wants to wait 10 seconds staring at a blank screen. Together’s streaming is fast:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
stream = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
    messages=[{"role": "user", "content": "Explain async programming in Python."}],
    stream=True,
    max_tokens=1024
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Tokens start flowing in 200-500ms. Compare that to self-hosted models where you’re waiting for cold starts and batching delays.

Async for High-Throughput Workloads

If you’re processing hundreds of prompts (batch classification, data labeling, eval runs), use async to parallelize:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import asyncio
from together import AsyncTogether

async def process_batch(prompts):
    client = AsyncTogether()

    tasks = [
        client.chat.completions.create(
            model="mistralai/Mixtral-8x7B-Instruct-v0.1",
            messages=[{"role": "user", "content": prompt}],
            max_tokens=256
        )
        for prompt in prompts
    ]

    responses = await asyncio.gather(*tasks)
    return [r.choices[0].message.content for r in responses]

prompts = [
    "Classify sentiment: This product is amazing!",
    "Classify sentiment: Worst purchase ever.",
    "Classify sentiment: It's okay, nothing special."
]

results = asyncio.run(process_batch(prompts))
for r in results:
    print(r)

This processes all three prompts concurrently instead of sequentially. For 100 prompts, you go from 100 seconds to under 10.

Function Calling and JSON Mode

Together supports OpenAI-compatible function calling. Use it for tool use, structured data extraction, or agent workflows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string", "description": "City name"},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
                },
                "required": ["location"]
            }
        }
    }
]

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
    messages=[{"role": "user", "content": "What's the weather in Paris?"}],
    tools=tools,
    tool_choice="auto"
)

if response.choices[0].message.tool_calls:
    tool_call = response.choices[0].message.tool_calls[0]
    print(f"Function: {tool_call.function.name}")
    print(f"Arguments: {tool_call.function.arguments}")

For guaranteed JSON output without function calling, use response_format:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo",
    messages=[{"role": "user", "content": "Extract name, email, and phone from: John Doe, [email protected], 555-1234"}],
    response_format={"type": "json_object", "schema": {
        "type": "object",
        "properties": {
            "name": {"type": "string"},
            "email": {"type": "string"},
            "phone": {"type": "string"}
        }
    }}
)

print(response.choices[0].message.content)  # Valid JSON guaranteed

Embeddings for RAG and Semantic Search

Together hosts solid embedding models if you’re building RAG systems:

1
2
3
4
5
6
7
8
embeddings = client.embeddings.create(
    model="togethercomputer/m2-bert-80M-8k-retrieval",
    input=["What is machine learning?", "How do neural networks work?"]
)

vectors = [e.embedding for e in embeddings.data]
print(f"Dimension: {len(vectors[0])}")  # 768
print(f"First 5 values: {vectors[0][:5]}")

The m2-bert model handles 8k token context windows and is tuned for retrieval. For production RAG, pair it with a vector store like Pinecone or Qdrant.

Fine-Tuning Your Own Models

Together’s fine-tuning is one of the easiest ways to customize open-source models. Upload your training data (JSONL format with prompt/completion pairs):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# Prepare training data
cat > training.jsonl << EOF
{"prompt": "Translate to French: Hello", "completion": "Bonjour"}
{"prompt": "Translate to French: Goodbye", "completion": "Au revoir"}
{"prompt": "Translate to French: Thank you", "completion": "Merci"}
EOF

# Upload file
together files upload training.jsonl

# Start fine-tune job
together fine-tuning create \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo \
  --training-file file-abc123 \
  --n-epochs 3 \
  --learning-rate 1e-5

Check job status with together fine-tuning list and use your fine-tuned model as soon as it’s ready. Pricing is per-token during training (check their dashboard for current rates).

Pricing and Performance vs Other Providers

Together’s pricing is token-based and varies by model. As of early 2026:

Llama 3.1 8B: ~$0.10 per 1M input tokens, ~$0.20 per 1M output
Llama 3.1 70B: ~$0.80 per 1M input, ~$1.20 per 1M output
Mixtral 8x7B: ~$0.60 per 1M input, ~$0.90 per 1M output

Compare that to GPT-4 Turbo at $10/$30 per 1M tokens. You’re looking at 10-15x cost savings for comparable quality on many tasks.

Latency is where Together shines. They use custom inference infrastructure (not just vLLM) and aggressive caching. Typical time-to-first-token for Llama 3.1 70B is under 400ms, total completion time for 512 tokens is 2-4 seconds. That’s competitive with OpenAI on smaller models.

Throughput limits depend on your plan. Free tier gets you 60 requests/minute. Paid plans scale to thousands of concurrent requests.

Best Practices for Production

Model selection: Start with Llama 3.1 8B for simple tasks (classification, summarization). Upgrade to 70B when you need better reasoning or complex instructions. Mixtral is the sweet spot for balanced cost/quality. DeepSeek V3 is excellent for code generation.

Caching: Together caches prompts server-side. Reusing system messages and common prefixes reduces latency and cost. Structure your prompts to maximize cache hits.

Error handling: Always wrap API calls in try/except. Together returns standard HTTP errors. Watch for 429 (rate limit) and 503 (temporary overload):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
from together import Together, TogetherError
import time

def call_with_retry(client, **kwargs):
    max_retries = 3
    for attempt in range(max_retries):
        try:
            return client.chat.completions.create(**kwargs)
        except TogetherError as e:
            if e.status_code == 429:
                wait = 2 ** attempt  # Exponential backoff
                print(f"Rate limited, waiting {wait}s")
                time.sleep(wait)
            else:
                raise
    raise Exception("Max retries exceeded")

Monitoring: Log token usage per request. Together’s response objects include usage.prompt_tokens and usage.completion_tokens. Track these to optimize costs and catch prompt bloat.

Common Errors and Fixes

Error: “Invalid model name” Together’s model naming uses the HuggingFace convention: org/model-name. Check their docs for the exact string. Common mistake: using llama-3 instead of meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo.

Error: “Context length exceeded” Each model has a max context window (8k, 32k, 128k). Count your tokens before sending. Use tiktoken or Together’s tokenizer to estimate:

1
2
3
4
5
6
7
# Rough estimate: 1 token ≈ 4 characters for English
def estimate_tokens(text):
    return len(text) // 4

prompt_tokens = estimate_tokens(your_prompt)
if prompt_tokens > 8000:
    print("Warning: Prompt may exceed context limit")

Error: “Rate limit exceeded” (429) You’re hitting requests-per-minute limits. Implement exponential backoff (see Best Practices). For sustained high throughput, upgrade your plan or contact Together for higher limits.

Slow responses or timeouts If you’re seeing 10+ second responses, check your max_tokens setting. Generating 4096 tokens takes longer than 512. Also verify network latency—if you’re in Asia and Together’s servers are US-based, expect 200-300ms extra RTT. Use streaming to improve perceived latency.

Inconsistent output quality Lower temperature (0.1-0.3) for factual tasks, higher (0.7-0.9) for creative tasks. If the model ignores your system message, put critical instructions in the user message too. Some models follow system prompts better than others—Llama 3.1 70B is more instruction-following than 8B.

Getting Started with the Python SDK#

Streaming Responses for Real-Time UX#

Async for High-Throughput Workloads#

Function Calling and JSON Mode#

Embeddings for RAG and Semantic Search#

Fine-Tuning Your Own Models#

Pricing and Performance vs Other Providers#

Best Practices for Production#

Common Errors and Fixes#

Related Guides#

About the Author