LiteLLM gives you a single completion() function that works with OpenAI, Anthropic, Azure, Google Vertex AI, Bedrock, Groq, Ollama, and 100+ other LLM providers. You write OpenAI-format code once, swap in any model string, and LiteLLM handles the provider-specific translation. It also ships a proxy server that acts as a full API gateway with authentication, load balancing, fallbacks, and per-key spend tracking.
Install and Make Your First Call
| |
Set your provider API keys as environment variables, then call completion() with the provider prefix:
| |
The model string format is provider/model-name. LiteLLM translates the OpenAI-style request into whatever format the target provider expects, including auth headers, message schemas, and streaming protocols. The response always comes back in the OpenAI format, so your downstream code never changes.
Streaming works the same way:
| |
Set Up Fallbacks for Reliability
LLM APIs go down. Rate limits hit. The simplest fallback pattern uses the fallbacks parameter:
| |
If the OpenAI call fails for any reason, LiteLLM automatically tries Anthropic, then Groq. The num_retries parameter controls how many times each model is retried before moving to the next fallback.
For production workloads, use the Router class instead. It gives you load balancing across multiple deployments of the same model, cooldown logic for failing endpoints, and configurable routing strategies:
| |
Both deployments share the logical name gpt-4o. The router distributes requests based on the latency-based-routing strategy, which tracks response times and sends traffic to the fastest endpoint. Other strategies include simple-shuffle (weighted random), least-busy (fewest in-flight requests), and cost-based-routing (cheapest provider first).
When a deployment fails, the router puts it in cooldown for cooldown_time seconds and routes around it. After the cooldown expires, it re-enters the pool.
Run the Proxy Server
The proxy turns LiteLLM into a standalone API gateway. Any application that speaks the OpenAI API format – whether it is written in Python, Go, TypeScript, or just uses curl – can send requests through it.
Install the proxy extras and create a config file:
| |
| |
The os.environ/ prefix tells LiteLLM to read secrets from environment variables at startup – never hardcode API keys in the YAML.
Start the proxy:
| |
Now any OpenAI-compatible client works against it:
| |
Or use the standard OpenAI Python SDK by pointing it at your proxy:
| |
This is the pattern that makes LiteLLM valuable in team settings. Backend developers, data scientists, and CI pipelines all hit the same gateway. You manage keys and routing in one place instead of distributing provider credentials everywhere.
Track Costs Per Key and User
The proxy automatically tracks spend when you configure virtual keys with a database backend. Add a PostgreSQL connection string to your config:
| |
Every request through the proxy logs the cost, token counts, model, and provider. The response headers include x-litellm-response-cost so clients can see per-request spend immediately.
Pull spend data per user or team:
| |
You can also set budget limits per model:
| |
When a model hits its budget, the proxy returns a 429 and routes to the next available deployment if fallbacks are configured.
Handle Errors Across Providers
LiteLLM maps every provider’s error format to OpenAI-compatible exceptions. This means you write one set of error handlers regardless of which backend you are calling.
| Status Code | LiteLLM Exception | When It Fires |
|---|---|---|
| 401 | AuthenticationError | Invalid or missing API key |
| 429 | RateLimitError | Provider rate limit exceeded |
| 400 | ContextWindowExceededError | Input too long for the model |
| 408 | Timeout | Request exceeded timeout |
| 500 | APIConnectionError | Network or provider outage |
| |
All LiteLLM exceptions inherit from their OpenAI counterparts, so existing openai.* except blocks work without changes. Each exception also carries llm_provider and status_code attributes for provider-specific logic.
For debugging, enable verbose logging:
| |
Or set the environment variable for the proxy:
| |
Common Pitfalls
Wrong model prefix: If you get a BadRequestError immediately, you probably used the wrong provider prefix. It is anthropic/claude-sonnet-4-20250514, not just claude-sonnet-4-20250514. Check the LiteLLM provider docs for the exact prefix each provider expects.
Environment variable not loaded: The os.environ/VARIABLE_NAME syntax in YAML config files only works at proxy startup. If you change an env var after the proxy is running, you need to restart it.
Mixing SDK and proxy: The litellm.completion() SDK call goes directly to providers. The proxy is a separate HTTP server. Do not set base_url on the litellm SDK to point at your own proxy – use the OpenAI SDK for that.
Cooldown masking failures: If allowed_fails is set too low and cooldown_time is too high, a single transient error can take a deployment out of rotation for minutes. Start with allowed_fails=3 and cooldown_time=30 in production, then tune based on your error rates.
Related Guides
- How to Use the Anthropic Extended Thinking API for Complex Reasoning
- How to Use the Anthropic Batch API for High-Volume Processing
- How to Use the Anthropic Citations API for Grounded Responses
- How to Use the Anthropic Token Streaming API for Real-Time UIs
- How to Use the Anthropic Message Batches API for Async Workloads
- How to Use the xAI Grok API for Chat and Function Calling
- How to Use the Anthropic Token Efficient Tool Use API
- How to Use the Together AI API for Open-Source LLMs
- How to Use the Anthropic Token Counting API for Cost Estimation
- How to Use the Anthropic Claude Files API for Large Document Processing