Not every prompt deserves the same model. A “what’s 2+2” question burning tokens on Claude Opus is money on fire. A nuanced legal analysis on GPT-4o-mini will give you garbage. The solution: classify each prompt, route it to the cheapest model that can handle it, and cascade to stronger models if the first one chokes.
1
| pip install litellm tiktoken
|
LiteLLM gives you a unified completion() call across OpenAI, Anthropic, Google, Groq, and 100+ other providers. You change the model string, not your code.
Classify Prompts by Complexity#
Before routing, you need to know what you’re dealing with. A keyword-and-length heuristic gets you surprisingly far. You don’t need an embedding model for the classifier itself – save that complexity for when simple rules stop working.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
| import re
from enum import Enum
class PromptTier(Enum):
SIMPLE = "simple"
MODERATE = "moderate"
COMPLEX = "complex"
def classify_prompt(prompt: str) -> PromptTier:
"""Classify a prompt into a complexity tier based on heuristics."""
word_count = len(prompt.split())
complex_signals = [
r"\banalyze\b", r"\bcompare\b", r"\bdesign\b", r"\barchitect\b",
r"\brefactor\b", r"\bdebug\b", r"\boptimize\b", r"\bprove\b",
r"\btrade-?offs?\b", r"\bimplications?\b",
]
moderate_signals = [
r"\bexplain\b", r"\bwrite\b", r"\bimplement\b", r"\bcreate\b",
r"\bsummarize\b", r"\btranslate\b",
]
lower = prompt.lower()
complex_hits = sum(1 for p in complex_signals if re.search(p, lower))
moderate_hits = sum(1 for p in moderate_signals if re.search(p, lower))
# Long prompts with multiple complex signals -> complex
if complex_hits >= 2 or (complex_hits >= 1 and word_count > 50):
return PromptTier.COMPLEX
# Moderate signals or medium length -> moderate
if moderate_hits >= 1 or word_count > 30:
return PromptTier.MODERATE
return PromptTier.SIMPLE
# Test it
print(classify_prompt("What is the capital of France?"))
# PromptTier.SIMPLE
print(classify_prompt("Explain how gradient descent works with momentum"))
# PromptTier.MODERATE
print(classify_prompt("Analyze the tradeoffs between LORA and full fine-tuning for a 70B parameter model, compare memory usage and downstream task performance"))
# PromptTier.COMPLEX
|
This is intentionally blunt. A keyword classifier won’t catch every edge case, but it handles 80% of traffic correctly and costs zero tokens. For the remaining 20%, the cascading fallback picks up the slack.
Build the Router with Cascading Fallback#
Here’s the full router class. Each tier maps to an ordered list of models. If the first model fails (API error, timeout, rate limit), it cascades to the next one automatically. LiteLLM’s completion() handles the provider translation.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
| import time
import litellm
from litellm import completion
from dataclasses import dataclass, field
@dataclass
class RequestLog:
prompt_tier: str
model_used: str
attempts: int
cost_usd: float
latency_ms: float
@dataclass
class PromptRouter:
"""Routes prompts to the best model tier with cascading fallback."""
tier_models: dict[str, list[str]] = field(default_factory=lambda: {
"simple": [
"anthropic/claude-haiku-4-20250514",
"openai/gpt-4o-mini",
"groq/llama-3.3-70b-versatile",
],
"moderate": [
"openai/gpt-4o",
"anthropic/claude-sonnet-4-20250514",
"groq/llama-3.3-70b-versatile",
],
"complex": [
"anthropic/claude-sonnet-4-20250514",
"openai/gpt-4o",
"anthropic/claude-haiku-4-20250514",
],
})
request_log: list[RequestLog] = field(default_factory=list)
def route(self, prompt: str, system: str = "You are a helpful assistant.") -> str:
"""Classify the prompt, pick a model tier, and cascade on failure."""
tier = classify_prompt(prompt)
models = self.tier_models[tier.value]
last_error = None
for attempt, model in enumerate(models, 1):
start = time.time()
try:
response = completion(
model=model,
messages=[
{"role": "system", "content": system},
{"role": "user", "content": prompt},
],
timeout=30,
num_retries=1,
)
latency_ms = (time.time() - start) * 1000
cost = litellm.completion_cost(completion_response=response)
self.request_log.append(RequestLog(
prompt_tier=tier.value,
model_used=model,
attempts=attempt,
cost_usd=cost,
latency_ms=latency_ms,
))
return response.choices[0].message.content
except Exception as e:
last_error = e
print(f"[cascade] {model} failed (attempt {attempt}/{len(models)}): {e}")
raise RuntimeError(f"All models failed for tier '{tier.value}'. Last error: {last_error}")
def total_cost(self) -> float:
"""Return total cost across all logged requests."""
return sum(r.cost_usd for r in self.request_log)
def cost_breakdown(self) -> dict[str, float]:
"""Return cost grouped by model."""
breakdown: dict[str, float] = {}
for r in self.request_log:
breakdown[r.model_used] = breakdown.get(r.model_used, 0) + r.cost_usd
return breakdown
|
Use it like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| router = PromptRouter()
# Simple question -> hits claude-haiku first (cheap and fast)
answer = router.route("What is the capital of France?")
print(answer)
# Complex analysis -> hits claude-sonnet first
answer = router.route(
"Analyze the tradeoffs between vector databases and traditional "
"full-text search for a RAG pipeline handling 10M documents"
)
print(answer)
# Check spending
print(f"Total cost: ${router.total_cost():.6f}")
print(f"Breakdown: {router.cost_breakdown()}")
|
The key insight: simple prompts hit Haiku at ~$0.25/M input tokens. Complex prompts hit Sonnet at ~$3/M input tokens. If 70% of your traffic is simple, your blended cost drops dramatically compared to sending everything to a frontier model.
Track Costs and Optimize Routing#
The RequestLog entries give you data to tune the classifier over time. Export them to a CSV or database, then look for patterns: which prompts get misrouted? Which models fail most often?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
| import csv
import io
def export_logs_csv(router: PromptRouter) -> str:
"""Export request logs to CSV string for analysis."""
output = io.StringIO()
writer = csv.DictWriter(output, fieldnames=[
"prompt_tier", "model_used", "attempts", "cost_usd", "latency_ms",
])
writer.writeheader()
for log_entry in router.request_log:
writer.writerow({
"prompt_tier": log_entry.prompt_tier,
"model_used": log_entry.model_used,
"attempts": log_entry.attempts,
"cost_usd": f"{log_entry.cost_usd:.8f}",
"latency_ms": f"{log_entry.latency_ms:.1f}",
})
return output.getvalue()
# After running a batch of prompts through the router
csv_data = export_logs_csv(router)
print(csv_data)
|
A few things worth monitoring:
- Cascade rate – if more than 10% of requests cascade to a second model, your primary model for that tier is unreliable. Swap it out or increase
num_retries. - Cost per tier – if your “simple” tier costs more than expected, your classifier might be letting complex prompts through.
- Latency spikes – cascading adds latency. If the first model times out after 30 seconds before falling back, your user is waiting 30+ seconds. Reduce the timeout for fast-tier models to 10 seconds.
Add Quality-Based Cascading#
API errors aren’t the only reason to cascade. Sometimes a model returns a response, but it’s low quality – too short, off-topic, or nonsensical. You can add a quality check that triggers cascading even on successful API calls.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
| def check_response_quality(prompt: str, response_text: str) -> bool:
"""Basic quality check. Returns True if the response seems adequate."""
# Too short for the prompt length
if len(prompt.split()) > 20 and len(response_text.split()) < 15:
return False
# Empty or whitespace-only
if not response_text.strip():
return False
# Refusal detection
refusal_phrases = [
"i cannot", "i can't", "i'm unable to", "as an ai",
"i don't have the ability",
]
lower_response = response_text.lower()
if any(phrase in lower_response for phrase in refusal_phrases):
return False
return True
class QualityCascadeRouter(PromptRouter):
"""Extends PromptRouter with quality-based cascading."""
def route(self, prompt: str, system: str = "You are a helpful assistant.") -> str:
tier = classify_prompt(prompt)
models = self.tier_models[tier.value]
last_error = None
for attempt, model in enumerate(models, 1):
start = time.time()
try:
response = completion(
model=model,
messages=[
{"role": "system", "content": system},
{"role": "user", "content": prompt},
],
timeout=30,
num_retries=1,
)
response_text = response.choices[0].message.content
latency_ms = (time.time() - start) * 1000
cost = litellm.completion_cost(completion_response=response)
self.request_log.append(RequestLog(
prompt_tier=tier.value,
model_used=model,
attempts=attempt,
cost_usd=cost,
latency_ms=latency_ms,
))
if check_response_quality(prompt, response_text):
return response_text
print(f"[quality] {model} response failed quality check, cascading...")
except Exception as e:
last_error = e
print(f"[cascade] {model} failed: {e}")
raise RuntimeError(f"All models failed for tier '{tier.value}'. Last error: {last_error}")
|
This catches the case where a cheap model technically responds but gives you a one-word answer to a paragraph-long question. The quality gate pushes it up to the next model in the cascade.
Common Errors and Fixes#
litellm.exceptions.AuthenticationError: OpenAIException - Incorrect API key provided
Your OPENAI_API_KEY environment variable is missing or wrong. LiteLLM reads provider keys from env vars automatically. Set them before importing:
1
2
| export OPENAI_API_KEY="sk-proj-..."
export ANTHROPIC_API_KEY="sk-ant-api03-..."
|
litellm.exceptions.RateLimitError: Rate limit reached for gpt-4o-mini
This is actually good news – it means your cascade is about to kick in. If you’re seeing it too often, add num_retries=2 to your completion() call so LiteLLM retries with exponential backoff before cascading. You can also set litellm.set_verbose = True to see the retry logs.
litellm.exceptions.BadRequestError: model is not available
You’re using a model string that LiteLLM doesn’t recognize. Always use the provider/model-name format: openai/gpt-4o-mini, anthropic/claude-haiku-4-20250514. Check available models with:
1
2
| import litellm
print(litellm.model_list) # all supported models
|
TypeError: completion_cost() got an unexpected keyword argument
You’re on an old version of LiteLLM. The cost tracking API has changed between versions. Update to the latest:
1
| pip install --upgrade litellm
|
Response quality is inconsistent across tiers
Your classifier is probably misrouting prompts. Add logging to classify_prompt() and review a sample of 50-100 real prompts. Adjust the keyword lists and word count thresholds based on what you see. There’s no universal threshold – it depends on your traffic patterns.