Production LLM apps break. OpenAI hits rate limits at 3 AM, Anthropic goes down for maintenance, and your users see a blank screen. The fix is a fallback chain – a pipeline that tries Provider A, catches the failure, and automatically switches to Provider B or C. Here’s how to build one that actually works.
Basic Fallback Chain Across Providers#
The simplest pattern is a try/except cascade. You attempt OpenAI first, fall back to Anthropic, and if both fail, hit a local Ollama instance as a last resort.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
| import httpx
from openai import OpenAI, APIError, RateLimitError, APIConnectionError
from anthropic import Anthropic, APIStatusError
openai_client = OpenAI() # uses OPENAI_API_KEY env var
anthropic_client = Anthropic() # uses ANTHROPIC_API_KEY env var
OLLAMA_BASE = "http://localhost:11434"
def prompt_with_fallback(prompt: str, system: str = "You are a helpful assistant.") -> str:
"""Try OpenAI -> Anthropic -> Ollama. Returns the first successful response."""
# Attempt 1: OpenAI
try:
response = openai_client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": system},
{"role": "user", "content": prompt},
],
timeout=30,
)
return response.choices[0].message.content
except (APIError, RateLimitError, APIConnectionError) as e:
print(f"OpenAI failed: {e}")
# Attempt 2: Anthropic
try:
response = anthropic_client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system=system,
messages=[{"role": "user", "content": prompt}],
)
return response.content[0].text
except APIStatusError as e:
print(f"Anthropic failed: {e}")
# Attempt 3: Local Ollama
try:
resp = httpx.post(
f"{OLLAMA_BASE}/api/chat",
json={
"model": "llama3.1:8b",
"messages": [
{"role": "system", "content": system},
{"role": "user", "content": prompt},
],
"stream": False,
},
timeout=60,
)
resp.raise_for_status()
return resp.json()["message"]["content"]
except (httpx.HTTPError, KeyError) as e:
print(f"Ollama failed: {e}")
raise RuntimeError("All LLM providers failed")
# Usage
result = prompt_with_fallback("Explain gradient descent in two sentences.")
print(result)
|
This works fine for simple cases. Each provider has its own SDK and error types, so you catch them individually. The order matters – put your cheapest or fastest provider first, and the most reliable (like a local model) last.
Unified Fallback with LiteLLM#
Writing separate try/except blocks per provider gets tedious. LiteLLM wraps 100+ models behind a single completion() call and has built-in fallback support.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
| import litellm
from litellm import completion
# Enable fallbacks at the call level
litellm.set_verbose = False
# Define your fallback order
models = ["gpt-4o", "claude-sonnet-4-20250514", "ollama/llama3.1:8b"]
def prompt_with_litellm_fallback(prompt: str) -> str:
"""Use LiteLLM's built-in fallback across multiple providers."""
for model in models:
try:
response = completion(
model=model,
messages=[{"role": "user", "content": prompt}],
timeout=30,
)
return response.choices[0].message.content
except Exception as e:
print(f"{model} failed: {e}")
continue
raise RuntimeError("All models in fallback chain failed")
# Or use LiteLLM's native Router for smarter fallback
from litellm import Router
router = Router(
model_list=[
{
"model_name": "primary",
"litellm_params": {"model": "gpt-4o", "api_key": "sk-..."},
},
{
"model_name": "primary",
"litellm_params": {"model": "claude-sonnet-4-20250514", "api_key": "sk-ant-..."},
},
],
fallbacks=[{"primary": ["primary"]}],
num_retries=2,
retry_after=5,
)
response = router.completion(
model="primary",
messages=[{"role": "user", "content": "What is backpropagation?"}],
)
print(response.choices[0].message.content)
|
LiteLLM’s Router gives you retry counts, cooldown periods, and automatic failover without writing the loop yourself. The model_list maps a logical name (“primary”) to multiple physical deployments. When one fails, it tries the next deployment with the same logical name.
Retry Logic with Exponential Backoff#
Rate limit errors are temporary. You don’t want to burn through your fallback chain on a transient 429. Add exponential backoff before switching providers.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
| import time
import random
from openai import OpenAI, RateLimitError, APIError, APIConnectionError
client = OpenAI()
def retry_with_backoff(
prompt: str,
model: str = "gpt-4o",
max_retries: int = 3,
base_delay: float = 1.0,
) -> str:
"""Retry a single provider with exponential backoff + jitter."""
for attempt in range(max_retries):
try:
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
timeout=30,
)
return response.choices[0].message.content
except RateLimitError as e:
if attempt == max_retries - 1:
raise
delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
print(f"Rate limited. Retrying in {delay:.1f}s (attempt {attempt + 1}/{max_retries})")
time.sleep(delay)
except (APIError, APIConnectionError):
raise # Don't retry server errors, fall through to next provider
raise RuntimeError(f"Max retries exceeded for {model}")
def full_fallback_with_retries(prompt: str) -> str:
"""Combine per-provider retries with cross-provider fallback."""
providers = [
{"fn": lambda p: retry_with_backoff(p, model="gpt-4o"), "name": "OpenAI"},
{"fn": lambda p: retry_with_backoff(p, model="gpt-4o-mini"), "name": "OpenAI Mini"},
]
for provider in providers:
try:
return provider["fn"](prompt)
except Exception as e:
print(f"{provider['name']} exhausted: {e}")
raise RuntimeError("All providers exhausted after retries")
|
The key detail: only retry on rate limits (429). Server errors (500) and auth errors (401) won’t fix themselves with retries, so let those fall through to the next provider immediately.
Health Checks and Circuit Breaker Pattern#
Retrying a dead provider wastes time. A circuit breaker tracks failures and skips providers that are known to be down.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
| import time
from dataclasses import dataclass, field
@dataclass
class CircuitBreaker:
failure_threshold: int = 3
recovery_timeout: float = 60.0 # seconds before retrying a tripped breaker
_failure_count: int = field(default=0, init=False)
_last_failure_time: float = field(default=0.0, init=False)
_state: str = field(default="closed", init=False) # closed = healthy, open = broken
def record_failure(self):
self._failure_count += 1
self._last_failure_time = time.time()
if self._failure_count >= self.failure_threshold:
self._state = "open"
print(f"Circuit breaker OPEN after {self._failure_count} failures")
def record_success(self):
self._failure_count = 0
self._state = "closed"
def is_available(self) -> bool:
if self._state == "closed":
return True
# Check if enough time has passed to try again (half-open)
if time.time() - self._last_failure_time > self.recovery_timeout:
self._state = "half-open"
return True
return False
class ResilientLLMChain:
def __init__(self):
self.providers = {
"openai": {"breaker": CircuitBreaker(), "model": "gpt-4o"},
"anthropic": {"breaker": CircuitBreaker(), "model": "claude-sonnet-4-20250514"},
"ollama": {"breaker": CircuitBreaker(failure_threshold=5), "model": "llama3.1:8b"},
}
def call(self, prompt: str) -> str:
for name, provider in self.providers.items():
breaker = provider["breaker"]
if not breaker.is_available():
print(f"Skipping {name} (circuit open)")
continue
try:
result = self._call_provider(name, provider["model"], prompt)
breaker.record_success()
return result
except Exception as e:
print(f"{name} failed: {e}")
breaker.record_failure()
raise RuntimeError("All providers unavailable")
def _call_provider(self, name: str, model: str, prompt: str) -> str:
if name == "openai":
from openai import OpenAI
client = OpenAI()
resp = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": prompt}],
timeout=30,
)
return resp.choices[0].message.content
elif name == "anthropic":
from anthropic import Anthropic
client = Anthropic()
resp = client.messages.create(
model=model,
max_tokens=1024,
messages=[{"role": "user", "content": prompt}],
)
return resp.content[0].text
elif name == "ollama":
import httpx
resp = httpx.post(
"http://localhost:11434/api/chat",
json={
"model": model,
"messages": [{"role": "user", "content": prompt}],
"stream": False,
},
timeout=60,
)
resp.raise_for_status()
return resp.json()["message"]["content"]
raise ValueError(f"Unknown provider: {name}")
# Usage
chain = ResilientLLMChain()
answer = chain.call("Summarize the attention mechanism in transformers.")
print(answer)
|
The circuit breaker has three states. Closed means healthy – requests go through normally. After three consecutive failures, it flips to open and all requests skip that provider. After the recovery timeout (60 seconds), it enters half-open – one test request goes through. If it succeeds, the breaker resets. If it fails, back to open.
This pattern saves you from burning timeout seconds on a provider you already know is down.
Common Errors and Fixes#
openai.RateLimitError: Error code: 429
You’ve hit your requests-per-minute or tokens-per-minute limit. Add exponential backoff with jitter. If it’s persistent, lower your max_tokens or switch to a cheaper model like gpt-4o-mini as a fallback.
anthropic.APIStatusError: 529 Overloaded
Anthropic’s servers are at capacity. This is temporary. Retry after 5-10 seconds. Don’t burn through your whole fallback chain – add a short sleep first.
httpx.ConnectError: [Errno 111] Connection refused (Ollama)
Ollama isn’t running. Start it with ollama serve in another terminal. If you’re running inside Docker, make sure Ollama’s port (11434) is exposed and use the correct host (not localhost – use host.docker.internal on Mac/Windows or the container’s network IP on Linux).
litellm.exceptions.AuthenticationError: OpenAIException - Incorrect API key
Your OPENAI_API_KEY or ANTHROPIC_API_KEY env var isn’t set or is wrong. LiteLLM reads from the same environment variables as the native SDKs. Double-check with echo $OPENAI_API_KEY.
openai.APIConnectionError: Connection error
Network issue or the API endpoint is unreachable. This could be a proxy, firewall, or DNS problem. Check if you’re behind a corporate VPN. You can also set openai.base_url to route through a different endpoint.
TypeError: Completions.create() got an unexpected keyword argument 'max_tokens_to_sample'
You’re mixing up Anthropic’s old API parameters with OpenAI’s SDK. The OpenAI SDK uses max_tokens (or max_completion_tokens for newer models). Anthropic’s SDK also uses max_tokens. The old max_tokens_to_sample parameter is from Anthropic’s deprecated v1 API.