How to Use LiteLLM as a Universal LLM API Gateway

LiteLLM gives you a single completion() function that works with OpenAI, Anthropic, Azure, Google Vertex AI, Bedrock, Groq, Ollama, and 100+ other LLM providers. You write OpenAI-format code once, swap in any model string, and LiteLLM handles the provider-specific translation. It also ships a proxy server that acts as a full API gateway with authentication, load balancing, fallbacks, and per-key spend tracking.

Install and Make Your First Call

1
pip install litellm

Set your provider API keys as environment variables, then call completion() with the provider prefix:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
from litellm import completion
import os

os.environ["OPENAI_API_KEY"] = "sk-..."
os.environ["ANTHROPIC_API_KEY"] = "sk-ant-..."

# Call OpenAI
response = completion(
    model="openai/gpt-4o",
    messages=[{"role": "user", "content": "Explain gradient descent in two sentences."}]
)
print(response.choices[0].message.content)

# Same code, different provider -- just change the model string
response = completion(
    model="anthropic/claude-sonnet-4-20250514",
    messages=[{"role": "user", "content": "Explain gradient descent in two sentences."}]
)
print(response.choices[0].message.content)

The model string format is provider/model-name. LiteLLM translates the OpenAI-style request into whatever format the target provider expects, including auth headers, message schemas, and streaming protocols. The response always comes back in the OpenAI format, so your downstream code never changes.

Streaming works the same way:

1
2
3
4
5
6
7
response = completion(
    model="openai/gpt-4o",
    messages=[{"role": "user", "content": "Write a haiku about distributed systems."}],
    stream=True
)
for chunk in response:
    print(chunk.choices[0].delta.content or "", end="")

Set Up Fallbacks for Reliability

LLM APIs go down. Rate limits hit. The simplest fallback pattern uses the fallbacks parameter:

1
2
3
4
5
6
7
8
from litellm import completion

response = completion(
    model="openai/gpt-4o",
    messages=[{"role": "user", "content": "Summarize this document."}],
    fallbacks=["anthropic/claude-sonnet-4-20250514", "groq/llama-3.3-70b-versatile"],
    num_retries=2
)

If the OpenAI call fails for any reason, LiteLLM automatically tries Anthropic, then Groq. The num_retries parameter controls how many times each model is retried before moving to the next fallback.

For production workloads, use the Router class instead. It gives you load balancing across multiple deployments of the same model, cooldown logic for failing endpoints, and configurable routing strategies:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
from litellm import Router
import os

model_list = [
    {
        "model_name": "gpt-4o",
        "litellm_params": {
            "model": "azure/gpt-4o-prod",
            "api_key": os.getenv("AZURE_API_KEY"),
            "api_base": os.getenv("AZURE_API_BASE"),
            "api_version": "2024-06-01",
            "rpm": 900
        }
    },
    {
        "model_name": "gpt-4o",
        "litellm_params": {
            "model": "openai/gpt-4o",
            "api_key": os.getenv("OPENAI_API_KEY"),
            "rpm": 500
        }
    }
]

router = Router(
    model_list=model_list,
    routing_strategy="latency-based-routing",
    num_retries=3,
    allowed_fails=1,
    cooldown_time=30,
    retry_after=5
)

response = router.completion(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Hello!"}]
)

Both deployments share the logical name gpt-4o. The router distributes requests based on the latency-based-routing strategy, which tracks response times and sends traffic to the fastest endpoint. Other strategies include simple-shuffle (weighted random), least-busy (fewest in-flight requests), and cost-based-routing (cheapest provider first).

When a deployment fails, the router puts it in cooldown for cooldown_time seconds and routes around it. After the cooldown expires, it re-enters the pool.

Run the Proxy Server

The proxy turns LiteLLM into a standalone API gateway. Any application that speaks the OpenAI API format – whether it is written in Python, Go, TypeScript, or just uses curl – can send requests through it.

Install the proxy extras and create a config file:

1
pip install 'litellm[proxy]'

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# litellm_config.yaml
model_list:
  - model_name: gpt-4o
    litellm_params:
      model: azure/gpt-4o-prod
      api_base: https://my-deployment.openai.azure.com/
      api_key: os.environ/AZURE_API_KEY
      api_version: "2024-06-01"
      rpm: 900
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY
      rpm: 500
  - model_name: claude-sonnet
    litellm_params:
      model: anthropic/claude-sonnet-4-20250514
      api_key: os.environ/ANTHROPIC_API_KEY

router_settings:
  routing_strategy: latency-based-routing
  num_retries: 3
  allowed_fails: 2
  cooldown_time: 60

general_settings:
  master_key: sk-your-proxy-master-key

The os.environ/ prefix tells LiteLLM to read secrets from environment variables at startup – never hardcode API keys in the YAML.

Start the proxy:

1
2
litellm --config litellm_config.yaml
# INFO: Proxy running on http://0.0.0.0:4000

Now any OpenAI-compatible client works against it:

1
2
3
4
5
6
7
curl http://0.0.0.0:4000/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-your-proxy-master-key" \
  -d '{
    "model": "gpt-4o",
    "messages": [{"role": "user", "content": "What is LiteLLM?"}]
  }'

Or use the standard OpenAI Python SDK by pointing it at your proxy:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
import openai

client = openai.OpenAI(
    api_key="sk-your-proxy-master-key",
    base_url="http://0.0.0.0:4000"
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "What is LiteLLM?"}]
)

This is the pattern that makes LiteLLM valuable in team settings. Backend developers, data scientists, and CI pipelines all hit the same gateway. You manage keys and routing in one place instead of distributing provider credentials everywhere.

Track Costs Per Key and User

The proxy automatically tracks spend when you configure virtual keys with a database backend. Add a PostgreSQL connection string to your config:

1
2
3
general_settings:
  master_key: sk-your-proxy-master-key
  database_url: os.environ/DATABASE_URL

Every request through the proxy logs the cost, token counts, model, and provider. The response headers include x-litellm-response-cost so clients can see per-request spend immediately.

Pull spend data per user or team:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Per-user spend
curl 'http://0.0.0.0:4000/user/info?user_id=jane_smith' \
  -H 'Authorization: Bearer sk-your-proxy-master-key'

# Daily breakdown
curl 'http://0.0.0.0:4000/user/daily/activity?start_date=2026-02-01&end_date=2026-02-14' \
  -H 'Authorization: Bearer sk-your-proxy-master-key'

# Team spend report
curl 'http://0.0.0.0:4000/global/spend/report?start_date=2026-01-01&end_date=2026-02-14&group_by=team' \
  -H 'Authorization: Bearer sk-your-proxy-master-key'

You can also set budget limits per model:

1
2
3
4
5
6
7
model_list:
  - model_name: gpt-4o
    litellm_params:
      model: openai/gpt-4o
      api_key: os.environ/OPENAI_API_KEY
      max_budget: 10.0        # $10/day cap
      budget_duration: "1d"

When a model hits its budget, the proxy returns a 429 and routes to the next available deployment if fallbacks are configured.

Handle Errors Across Providers

LiteLLM maps every provider’s error format to OpenAI-compatible exceptions. This means you write one set of error handlers regardless of which backend you are calling.

Status Code	LiteLLM Exception	When It Fires
401	`AuthenticationError`	Invalid or missing API key
429	`RateLimitError`	Provider rate limit exceeded
400	`ContextWindowExceededError`	Input too long for the model
408	`Timeout`	Request exceeded timeout
500	`APIConnectionError`	Network or provider outage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
import litellm
import openai

try:
    response = litellm.completion(
        model="openai/gpt-4o",
        messages=[{"role": "user", "content": "Hello"}],
        timeout=10
    )
except openai.AuthenticationError as e:
    print(f"Bad API key for {e.llm_provider}: {e.message}")
except openai.RateLimitError as e:
    print(f"Rate limited by {e.llm_provider}, retry after backoff")
except openai.APITimeoutError as e:
    print(f"Request timed out: {e.message}")

All LiteLLM exceptions inherit from their OpenAI counterparts, so existing openai.* except blocks work without changes. Each exception also carries llm_provider and status_code attributes for provider-specific logic.

For debugging, enable verbose logging:

1
litellm.set_verbose = True

Or set the environment variable for the proxy:

1
2
export LITELLM_LOG_LEVEL=DEBUG
litellm --config litellm_config.yaml

Common Pitfalls

Wrong model prefix: If you get a BadRequestError immediately, you probably used the wrong provider prefix. It is anthropic/claude-sonnet-4-20250514, not just claude-sonnet-4-20250514. Check the LiteLLM provider docs for the exact prefix each provider expects.

Environment variable not loaded: The os.environ/VARIABLE_NAME syntax in YAML config files only works at proxy startup. If you change an env var after the proxy is running, you need to restart it.

Mixing SDK and proxy: The litellm.completion() SDK call goes directly to providers. The proxy is a separate HTTP server. Do not set base_url on the litellm SDK to point at your own proxy – use the OpenAI SDK for that.

Cooldown masking failures: If allowed_fails is set too low and cooldown_time is too high, a single transient error can take a deployment out of rotation for minutes. Start with allowed_fails=3 and cooldown_time=30 in production, then tune based on your error rates.

Install and Make Your First Call#

Set Up Fallbacks for Reliability#

Run the Proxy Server#

Track Costs Per Key and User#

Handle Errors Across Providers#

Common Pitfalls#

Related Guides#

About the Author