Not every prompt deserves the same model. A “what’s 2+2” question burning tokens on Claude Opus is money on fire. A nuanced legal analysis on GPT-4o-mini will give you garbage. The solution: classify each prompt, route it to the cheapest model that can handle it, and cascade to stronger models if the first one chokes.

1
pip install litellm tiktoken

LiteLLM gives you a unified completion() call across OpenAI, Anthropic, Google, Groq, and 100+ other providers. You change the model string, not your code.

Classify Prompts by Complexity

Before routing, you need to know what you’re dealing with. A keyword-and-length heuristic gets you surprisingly far. You don’t need an embedding model for the classifier itself – save that complexity for when simple rules stop working.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import re
from enum import Enum


class PromptTier(Enum):
    SIMPLE = "simple"
    MODERATE = "moderate"
    COMPLEX = "complex"


def classify_prompt(prompt: str) -> PromptTier:
    """Classify a prompt into a complexity tier based on heuristics."""
    word_count = len(prompt.split())

    complex_signals = [
        r"\banalyze\b", r"\bcompare\b", r"\bdesign\b", r"\barchitect\b",
        r"\brefactor\b", r"\bdebug\b", r"\boptimize\b", r"\bprove\b",
        r"\btrade-?offs?\b", r"\bimplications?\b",
    ]
    moderate_signals = [
        r"\bexplain\b", r"\bwrite\b", r"\bimplement\b", r"\bcreate\b",
        r"\bsummarize\b", r"\btranslate\b",
    ]

    lower = prompt.lower()
    complex_hits = sum(1 for p in complex_signals if re.search(p, lower))
    moderate_hits = sum(1 for p in moderate_signals if re.search(p, lower))

    # Long prompts with multiple complex signals -> complex
    if complex_hits >= 2 or (complex_hits >= 1 and word_count > 50):
        return PromptTier.COMPLEX
    # Moderate signals or medium length -> moderate
    if moderate_hits >= 1 or word_count > 30:
        return PromptTier.MODERATE
    return PromptTier.SIMPLE


# Test it
print(classify_prompt("What is the capital of France?"))
# PromptTier.SIMPLE

print(classify_prompt("Explain how gradient descent works with momentum"))
# PromptTier.MODERATE

print(classify_prompt("Analyze the tradeoffs between LORA and full fine-tuning for a 70B parameter model, compare memory usage and downstream task performance"))
# PromptTier.COMPLEX

This is intentionally blunt. A keyword classifier won’t catch every edge case, but it handles 80% of traffic correctly and costs zero tokens. For the remaining 20%, the cascading fallback picks up the slack.

Build the Router with Cascading Fallback

Here’s the full router class. Each tier maps to an ordered list of models. If the first model fails (API error, timeout, rate limit), it cascades to the next one automatically. LiteLLM’s completion() handles the provider translation.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
import time
import litellm
from litellm import completion
from dataclasses import dataclass, field


@dataclass
class RequestLog:
    prompt_tier: str
    model_used: str
    attempts: int
    cost_usd: float
    latency_ms: float


@dataclass
class PromptRouter:
    """Routes prompts to the best model tier with cascading fallback."""

    tier_models: dict[str, list[str]] = field(default_factory=lambda: {
        "simple": [
            "anthropic/claude-haiku-4-20250514",
            "openai/gpt-4o-mini",
            "groq/llama-3.3-70b-versatile",
        ],
        "moderate": [
            "openai/gpt-4o",
            "anthropic/claude-sonnet-4-20250514",
            "groq/llama-3.3-70b-versatile",
        ],
        "complex": [
            "anthropic/claude-sonnet-4-20250514",
            "openai/gpt-4o",
            "anthropic/claude-haiku-4-20250514",
        ],
    })
    request_log: list[RequestLog] = field(default_factory=list)

    def route(self, prompt: str, system: str = "You are a helpful assistant.") -> str:
        """Classify the prompt, pick a model tier, and cascade on failure."""
        tier = classify_prompt(prompt)
        models = self.tier_models[tier.value]

        last_error = None
        for attempt, model in enumerate(models, 1):
            start = time.time()
            try:
                response = completion(
                    model=model,
                    messages=[
                        {"role": "system", "content": system},
                        {"role": "user", "content": prompt},
                    ],
                    timeout=30,
                    num_retries=1,
                )

                latency_ms = (time.time() - start) * 1000
                cost = litellm.completion_cost(completion_response=response)

                self.request_log.append(RequestLog(
                    prompt_tier=tier.value,
                    model_used=model,
                    attempts=attempt,
                    cost_usd=cost,
                    latency_ms=latency_ms,
                ))

                return response.choices[0].message.content

            except Exception as e:
                last_error = e
                print(f"[cascade] {model} failed (attempt {attempt}/{len(models)}): {e}")

        raise RuntimeError(f"All models failed for tier '{tier.value}'. Last error: {last_error}")

    def total_cost(self) -> float:
        """Return total cost across all logged requests."""
        return sum(r.cost_usd for r in self.request_log)

    def cost_breakdown(self) -> dict[str, float]:
        """Return cost grouped by model."""
        breakdown: dict[str, float] = {}
        for r in self.request_log:
            breakdown[r.model_used] = breakdown.get(r.model_used, 0) + r.cost_usd
        return breakdown

Use it like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
router = PromptRouter()

# Simple question -> hits claude-haiku first (cheap and fast)
answer = router.route("What is the capital of France?")
print(answer)

# Complex analysis -> hits claude-sonnet first
answer = router.route(
    "Analyze the tradeoffs between vector databases and traditional "
    "full-text search for a RAG pipeline handling 10M documents"
)
print(answer)

# Check spending
print(f"Total cost: ${router.total_cost():.6f}")
print(f"Breakdown: {router.cost_breakdown()}")

The key insight: simple prompts hit Haiku at ~$0.25/M input tokens. Complex prompts hit Sonnet at ~$3/M input tokens. If 70% of your traffic is simple, your blended cost drops dramatically compared to sending everything to a frontier model.

Track Costs and Optimize Routing

The RequestLog entries give you data to tune the classifier over time. Export them to a CSV or database, then look for patterns: which prompts get misrouted? Which models fail most often?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import csv
import io


def export_logs_csv(router: PromptRouter) -> str:
    """Export request logs to CSV string for analysis."""
    output = io.StringIO()
    writer = csv.DictWriter(output, fieldnames=[
        "prompt_tier", "model_used", "attempts", "cost_usd", "latency_ms",
    ])
    writer.writeheader()
    for log_entry in router.request_log:
        writer.writerow({
            "prompt_tier": log_entry.prompt_tier,
            "model_used": log_entry.model_used,
            "attempts": log_entry.attempts,
            "cost_usd": f"{log_entry.cost_usd:.8f}",
            "latency_ms": f"{log_entry.latency_ms:.1f}",
        })
    return output.getvalue()


# After running a batch of prompts through the router
csv_data = export_logs_csv(router)
print(csv_data)

A few things worth monitoring:

  • Cascade rate – if more than 10% of requests cascade to a second model, your primary model for that tier is unreliable. Swap it out or increase num_retries.
  • Cost per tier – if your “simple” tier costs more than expected, your classifier might be letting complex prompts through.
  • Latency spikes – cascading adds latency. If the first model times out after 30 seconds before falling back, your user is waiting 30+ seconds. Reduce the timeout for fast-tier models to 10 seconds.

Add Quality-Based Cascading

API errors aren’t the only reason to cascade. Sometimes a model returns a response, but it’s low quality – too short, off-topic, or nonsensical. You can add a quality check that triggers cascading even on successful API calls.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
def check_response_quality(prompt: str, response_text: str) -> bool:
    """Basic quality check. Returns True if the response seems adequate."""
    # Too short for the prompt length
    if len(prompt.split()) > 20 and len(response_text.split()) < 15:
        return False

    # Empty or whitespace-only
    if not response_text.strip():
        return False

    # Refusal detection
    refusal_phrases = [
        "i cannot", "i can't", "i'm unable to", "as an ai",
        "i don't have the ability",
    ]
    lower_response = response_text.lower()
    if any(phrase in lower_response for phrase in refusal_phrases):
        return False

    return True


class QualityCascadeRouter(PromptRouter):
    """Extends PromptRouter with quality-based cascading."""

    def route(self, prompt: str, system: str = "You are a helpful assistant.") -> str:
        tier = classify_prompt(prompt)
        models = self.tier_models[tier.value]

        last_error = None
        for attempt, model in enumerate(models, 1):
            start = time.time()
            try:
                response = completion(
                    model=model,
                    messages=[
                        {"role": "system", "content": system},
                        {"role": "user", "content": prompt},
                    ],
                    timeout=30,
                    num_retries=1,
                )
                response_text = response.choices[0].message.content
                latency_ms = (time.time() - start) * 1000
                cost = litellm.completion_cost(completion_response=response)

                self.request_log.append(RequestLog(
                    prompt_tier=tier.value,
                    model_used=model,
                    attempts=attempt,
                    cost_usd=cost,
                    latency_ms=latency_ms,
                ))

                if check_response_quality(prompt, response_text):
                    return response_text

                print(f"[quality] {model} response failed quality check, cascading...")

            except Exception as e:
                last_error = e
                print(f"[cascade] {model} failed: {e}")

        raise RuntimeError(f"All models failed for tier '{tier.value}'. Last error: {last_error}")

This catches the case where a cheap model technically responds but gives you a one-word answer to a paragraph-long question. The quality gate pushes it up to the next model in the cascade.

Common Errors and Fixes

litellm.exceptions.AuthenticationError: OpenAIException - Incorrect API key provided

Your OPENAI_API_KEY environment variable is missing or wrong. LiteLLM reads provider keys from env vars automatically. Set them before importing:

1
2
export OPENAI_API_KEY="sk-proj-..."
export ANTHROPIC_API_KEY="sk-ant-api03-..."

litellm.exceptions.RateLimitError: Rate limit reached for gpt-4o-mini

This is actually good news – it means your cascade is about to kick in. If you’re seeing it too often, add num_retries=2 to your completion() call so LiteLLM retries with exponential backoff before cascading. You can also set litellm.set_verbose = True to see the retry logs.

litellm.exceptions.BadRequestError: model is not available

You’re using a model string that LiteLLM doesn’t recognize. Always use the provider/model-name format: openai/gpt-4o-mini, anthropic/claude-haiku-4-20250514. Check available models with:

1
2
import litellm
print(litellm.model_list)  # all supported models

TypeError: completion_cost() got an unexpected keyword argument

You’re on an old version of LiteLLM. The cost tracking API has changed between versions. Update to the latest:

1
pip install --upgrade litellm

Response quality is inconsistent across tiers

Your classifier is probably misrouting prompts. Add logging to classify_prompt() and review a sample of 50-100 real prompts. Adjust the keyword lists and word count thresholds based on what you see. There’s no universal threshold – it depends on your traffic patterns.