How to Build Prompt Fallback Chains with Automatic Model Switching

Production LLM apps break. OpenAI hits rate limits at 3 AM, Anthropic goes down for maintenance, and your users see a blank screen. The fix is a fallback chain – a pipeline that tries Provider A, catches the failure, and automatically switches to Provider B or C. Here’s how to build one that actually works.

Basic Fallback Chain Across Providers

The simplest pattern is a try/except cascade. You attempt OpenAI first, fall back to Anthropic, and if both fail, hit a local Ollama instance as a last resort.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
import httpx
from openai import OpenAI, APIError, RateLimitError, APIConnectionError
from anthropic import Anthropic, APIStatusError

openai_client = OpenAI()  # uses OPENAI_API_KEY env var
anthropic_client = Anthropic()  # uses ANTHROPIC_API_KEY env var
OLLAMA_BASE = "http://localhost:11434"


def prompt_with_fallback(prompt: str, system: str = "You are a helpful assistant.") -> str:
    """Try OpenAI -> Anthropic -> Ollama. Returns the first successful response."""

    # Attempt 1: OpenAI
    try:
        response = openai_client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": system},
                {"role": "user", "content": prompt},
            ],
            timeout=30,
        )
        return response.choices[0].message.content
    except (APIError, RateLimitError, APIConnectionError) as e:
        print(f"OpenAI failed: {e}")

    # Attempt 2: Anthropic
    try:
        response = anthropic_client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=1024,
            system=system,
            messages=[{"role": "user", "content": prompt}],
        )
        return response.content[0].text
    except APIStatusError as e:
        print(f"Anthropic failed: {e}")

    # Attempt 3: Local Ollama
    try:
        resp = httpx.post(
            f"{OLLAMA_BASE}/api/chat",
            json={
                "model": "llama3.1:8b",
                "messages": [
                    {"role": "system", "content": system},
                    {"role": "user", "content": prompt},
                ],
                "stream": False,
            },
            timeout=60,
        )
        resp.raise_for_status()
        return resp.json()["message"]["content"]
    except (httpx.HTTPError, KeyError) as e:
        print(f"Ollama failed: {e}")

    raise RuntimeError("All LLM providers failed")


# Usage
result = prompt_with_fallback("Explain gradient descent in two sentences.")
print(result)

This works fine for simple cases. Each provider has its own SDK and error types, so you catch them individually. The order matters – put your cheapest or fastest provider first, and the most reliable (like a local model) last.

Unified Fallback with LiteLLM

Writing separate try/except blocks per provider gets tedious. LiteLLM wraps 100+ models behind a single completion() call and has built-in fallback support.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import litellm
from litellm import completion

# Enable fallbacks at the call level
litellm.set_verbose = False

# Define your fallback order
models = ["gpt-4o", "claude-sonnet-4-20250514", "ollama/llama3.1:8b"]


def prompt_with_litellm_fallback(prompt: str) -> str:
    """Use LiteLLM's built-in fallback across multiple providers."""
    for model in models:
        try:
            response = completion(
                model=model,
                messages=[{"role": "user", "content": prompt}],
                timeout=30,
            )
            return response.choices[0].message.content
        except Exception as e:
            print(f"{model} failed: {e}")
            continue

    raise RuntimeError("All models in fallback chain failed")


# Or use LiteLLM's native Router for smarter fallback
from litellm import Router

router = Router(
    model_list=[
        {
            "model_name": "primary",
            "litellm_params": {"model": "gpt-4o", "api_key": "sk-..."},
        },
        {
            "model_name": "primary",
            "litellm_params": {"model": "claude-sonnet-4-20250514", "api_key": "sk-ant-..."},
        },
    ],
    fallbacks=[{"primary": ["primary"]}],
    num_retries=2,
    retry_after=5,
)

response = router.completion(
    model="primary",
    messages=[{"role": "user", "content": "What is backpropagation?"}],
)
print(response.choices[0].message.content)

LiteLLM’s Router gives you retry counts, cooldown periods, and automatic failover without writing the loop yourself. The model_list maps a logical name (“primary”) to multiple physical deployments. When one fails, it tries the next deployment with the same logical name.

Retry Logic with Exponential Backoff

Rate limit errors are temporary. You don’t want to burn through your fallback chain on a transient 429. Add exponential backoff before switching providers.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import time
import random
from openai import OpenAI, RateLimitError, APIError, APIConnectionError

client = OpenAI()


def retry_with_backoff(
    prompt: str,
    model: str = "gpt-4o",
    max_retries: int = 3,
    base_delay: float = 1.0,
) -> str:
    """Retry a single provider with exponential backoff + jitter."""
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model=model,
                messages=[{"role": "user", "content": prompt}],
                timeout=30,
            )
            return response.choices[0].message.content
        except RateLimitError as e:
            if attempt == max_retries - 1:
                raise
            delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
            print(f"Rate limited. Retrying in {delay:.1f}s (attempt {attempt + 1}/{max_retries})")
            time.sleep(delay)
        except (APIError, APIConnectionError):
            raise  # Don't retry server errors, fall through to next provider

    raise RuntimeError(f"Max retries exceeded for {model}")


def full_fallback_with_retries(prompt: str) -> str:
    """Combine per-provider retries with cross-provider fallback."""
    providers = [
        {"fn": lambda p: retry_with_backoff(p, model="gpt-4o"), "name": "OpenAI"},
        {"fn": lambda p: retry_with_backoff(p, model="gpt-4o-mini"), "name": "OpenAI Mini"},
    ]

    for provider in providers:
        try:
            return provider["fn"](prompt)
        except Exception as e:
            print(f"{provider['name']} exhausted: {e}")

    raise RuntimeError("All providers exhausted after retries")

The key detail: only retry on rate limits (429). Server errors (500) and auth errors (401) won’t fix themselves with retries, so let those fall through to the next provider immediately.

Health Checks and Circuit Breaker Pattern

Retrying a dead provider wastes time. A circuit breaker tracks failures and skips providers that are known to be down.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
import time
from dataclasses import dataclass, field


@dataclass
class CircuitBreaker:
    failure_threshold: int = 3
    recovery_timeout: float = 60.0  # seconds before retrying a tripped breaker
    _failure_count: int = field(default=0, init=False)
    _last_failure_time: float = field(default=0.0, init=False)
    _state: str = field(default="closed", init=False)  # closed = healthy, open = broken

    def record_failure(self):
        self._failure_count += 1
        self._last_failure_time = time.time()
        if self._failure_count >= self.failure_threshold:
            self._state = "open"
            print(f"Circuit breaker OPEN after {self._failure_count} failures")

    def record_success(self):
        self._failure_count = 0
        self._state = "closed"

    def is_available(self) -> bool:
        if self._state == "closed":
            return True
        # Check if enough time has passed to try again (half-open)
        if time.time() - self._last_failure_time > self.recovery_timeout:
            self._state = "half-open"
            return True
        return False


class ResilientLLMChain:
    def __init__(self):
        self.providers = {
            "openai": {"breaker": CircuitBreaker(), "model": "gpt-4o"},
            "anthropic": {"breaker": CircuitBreaker(), "model": "claude-sonnet-4-20250514"},
            "ollama": {"breaker": CircuitBreaker(failure_threshold=5), "model": "llama3.1:8b"},
        }

    def call(self, prompt: str) -> str:
        for name, provider in self.providers.items():
            breaker = provider["breaker"]

            if not breaker.is_available():
                print(f"Skipping {name} (circuit open)")
                continue

            try:
                result = self._call_provider(name, provider["model"], prompt)
                breaker.record_success()
                return result
            except Exception as e:
                print(f"{name} failed: {e}")
                breaker.record_failure()

        raise RuntimeError("All providers unavailable")

    def _call_provider(self, name: str, model: str, prompt: str) -> str:
        if name == "openai":
            from openai import OpenAI
            client = OpenAI()
            resp = client.chat.completions.create(
                model=model,
                messages=[{"role": "user", "content": prompt}],
                timeout=30,
            )
            return resp.choices[0].message.content

        elif name == "anthropic":
            from anthropic import Anthropic
            client = Anthropic()
            resp = client.messages.create(
                model=model,
                max_tokens=1024,
                messages=[{"role": "user", "content": prompt}],
            )
            return resp.content[0].text

        elif name == "ollama":
            import httpx
            resp = httpx.post(
                "http://localhost:11434/api/chat",
                json={
                    "model": model,
                    "messages": [{"role": "user", "content": prompt}],
                    "stream": False,
                },
                timeout=60,
            )
            resp.raise_for_status()
            return resp.json()["message"]["content"]

        raise ValueError(f"Unknown provider: {name}")


# Usage
chain = ResilientLLMChain()
answer = chain.call("Summarize the attention mechanism in transformers.")
print(answer)

The circuit breaker has three states. Closed means healthy – requests go through normally. After three consecutive failures, it flips to open and all requests skip that provider. After the recovery timeout (60 seconds), it enters half-open – one test request goes through. If it succeeds, the breaker resets. If it fails, back to open.

This pattern saves you from burning timeout seconds on a provider you already know is down.

Common Errors and Fixes

openai.RateLimitError: Error code: 429

You’ve hit your requests-per-minute or tokens-per-minute limit. Add exponential backoff with jitter. If it’s persistent, lower your max_tokens or switch to a cheaper model like gpt-4o-mini as a fallback.

anthropic.APIStatusError: 529 Overloaded

Anthropic’s servers are at capacity. This is temporary. Retry after 5-10 seconds. Don’t burn through your whole fallback chain – add a short sleep first.

httpx.ConnectError: [Errno 111] Connection refused (Ollama)

Ollama isn’t running. Start it with ollama serve in another terminal. If you’re running inside Docker, make sure Ollama’s port (11434) is exposed and use the correct host (not localhost – use host.docker.internal on Mac/Windows or the container’s network IP on Linux).

litellm.exceptions.AuthenticationError: OpenAIException - Incorrect API key

Your OPENAI_API_KEY or ANTHROPIC_API_KEY env var isn’t set or is wrong. LiteLLM reads from the same environment variables as the native SDKs. Double-check with echo $OPENAI_API_KEY.

openai.APIConnectionError: Connection error

Network issue or the API endpoint is unreachable. This could be a proxy, firewall, or DNS problem. Check if you’re behind a corporate VPN. You can also set openai.base_url to route through a different endpoint.

TypeError: Completions.create() got an unexpected keyword argument 'max_tokens_to_sample'

You’re mixing up Anthropic’s old API parameters with OpenAI’s SDK. The OpenAI SDK uses max_tokens (or max_completion_tokens for newer models). Anthropic’s SDK also uses max_tokens. The old max_tokens_to_sample parameter is from Anthropic’s deprecated v1 API.

Basic Fallback Chain Across Providers#

Unified Fallback with LiteLLM#

Retry Logic with Exponential Backoff#

Health Checks and Circuit Breaker Pattern#

Common Errors and Fixes#

Related Guides#

About the Author

Basic Fallback Chain Across Providers

Unified Fallback with LiteLLM

Retry Logic with Exponential Backoff

Health Checks and Circuit Breaker Pattern

Common Errors and Fixes

Related Guides