Most teams default to routing all traffic to one model. That’s expensive and unnecessary. A customer support query asking “what are your business hours?” doesn’t need GPT-4o. A query asking you to debug a 200-line Python function probably does.

Intelligent model routing solves this by classifying queries at runtime and dispatching them to the cheapest model that can handle the job. Done right, you can cut inference costs 60-80% with no measurable quality drop for your users.

The Multi-Model Routing Pattern

The pattern is straightforward: you put a router in front of your LLM calls. The router inspects each query, scores its complexity, and picks the appropriate model tier.

1
User Query → Complexity Classifier → [cheap | mid | frontier] → LLM → Response

Three tiers work well for most applications:

TierExample ModelsCost (input/M tokens)Use Case
CheapGemini 2.5 Flash, GPT-4o mini$0.15–$0.60FAQ, summarization, classification
MidClaude 3.5 Haiku, Gemini 2.5 Pro$1.00–$1.25Structured extraction, code review
FrontierGPT-4o, Claude 3.5 Sonnet$3.00–$5.00Complex reasoning, multi-step plans

Pricing from OpenAI, Anthropic, and Google as of February 2026. Check their pricing pages for current rates since they shift frequently.

Setting Up LiteLLM Router

LiteLLM’s Router class handles multi-model dispatch, fallbacks, and retries in one package. Install it:

1
pip install litellm>=1.30.0

Configure a router with all three tiers as named model groups:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import os
from litellm import Router

model_list = [
    # Cheap tier
    {
        "model_name": "cheap",
        "litellm_params": {
            "model": "gemini/gemini-2.0-flash",
            "api_key": os.environ["GEMINI_API_KEY"],
        },
    },
    {
        "model_name": "cheap",
        "litellm_params": {
            "model": "openai/gpt-4o-mini",
            "api_key": os.environ["OPENAI_API_KEY"],
        },
    },
    # Mid tier
    {
        "model_name": "mid",
        "litellm_params": {
            "model": "anthropic/claude-3-5-haiku-20241022",
            "api_key": os.environ["ANTHROPIC_API_KEY"],
        },
    },
    # Frontier tier
    {
        "model_name": "frontier",
        "litellm_params": {
            "model": "openai/gpt-4o",
            "api_key": os.environ["OPENAI_API_KEY"],
        },
    },
    {
        "model_name": "frontier",
        "litellm_params": {
            "model": "anthropic/claude-3-5-sonnet-20241022",
            "api_key": os.environ["ANTHROPIC_API_KEY"],
        },
    },
]

router = Router(
    model_list=model_list,
    routing_strategy="least-busy",  # balances across providers in the same tier
    allowed_fails=2,
    cooldown_time=60,
    num_retries=2,
)

Multiple entries with the same model_name (“cheap”, “frontier”, etc.) form a group. The router load-balances within a group and falls back across groups when you configure it that way.

Complexity Detection Heuristics

You need a way to classify incoming queries before dispatching them. The simplest approach uses a combination of signal types:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
import re
from dataclasses import dataclass
from enum import Enum

class ModelTier(Enum):
    CHEAP = "cheap"
    MID = "mid"
    FRONTIER = "frontier"

@dataclass
class RoutingDecision:
    tier: ModelTier
    reason: str
    confidence: float

# Keywords that signal complex reasoning tasks
FRONTIER_KEYWORDS = [
    "debug", "analyze", "explain why", "compare", "evaluate",
    "design", "architect", "optimize", "refactor", "review code",
    "step by step", "trade-offs", "pros and cons",
]

# Keywords that signal simple lookup / classification tasks
CHEAP_KEYWORDS = [
    "what is", "define", "list", "summarize", "translate",
    "format", "convert", "classify", "is this", "yes or no",
]

def classify_query(query: str, system_prompt: str = "") -> RoutingDecision:
    query_lower = query.lower()
    token_estimate = len(query.split()) * 1.3  # rough token count

    # Long queries almost always need more capable models
    if token_estimate > 800:
        return RoutingDecision(
            tier=ModelTier.FRONTIER,
            reason="long_query",
            confidence=0.9,
        )

    # Check for code blocks — code tasks need mid or frontier
    if "```" in query or re.search(r"\bdef \w+|class \w+|function \w+", query):
        if any(kw in query_lower for kw in ["debug", "fix", "why", "error"]):
            return RoutingDecision(
                tier=ModelTier.FRONTIER,
                reason="code_debugging",
                confidence=0.85,
            )
        return RoutingDecision(
            tier=ModelTier.MID,
            reason="code_present",
            confidence=0.8,
        )

    # Keyword scoring
    frontier_score = sum(1 for kw in FRONTIER_KEYWORDS if kw in query_lower)
    cheap_score = sum(1 for kw in CHEAP_KEYWORDS if kw in query_lower)

    if frontier_score >= 2 or (frontier_score >= 1 and token_estimate > 200):
        return RoutingDecision(
            tier=ModelTier.FRONTIER,
            reason="frontier_keywords",
            confidence=min(0.6 + frontier_score * 0.1, 0.95),
        )

    if cheap_score >= 1 and token_estimate < 150:
        return RoutingDecision(
            tier=ModelTier.CHEAP,
            reason="cheap_keywords",
            confidence=min(0.6 + cheap_score * 0.1, 0.9),
        )

    # Default to mid when uncertain
    return RoutingDecision(
        tier=ModelTier.MID,
        reason="default",
        confidence=0.5,
    )

This gives you a deterministic, zero-latency classifier with no extra LLM calls. It’s not perfect but it’s a solid starting point.

Putting It Together: The Router Function

Wire the classifier to the LiteLLM router:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
import asyncio
import logging
from typing import Any

logger = logging.getLogger(__name__)

async def routed_completion(
    messages: list[dict],
    system_prompt: str = "",
    force_tier: ModelTier | None = None,
    **kwargs: Any,
) -> dict:
    """Send a completion request to the appropriate model tier."""
    user_message = next(
        (m["content"] for m in messages if m["role"] == "user"), ""
    )

    if force_tier:
        decision = RoutingDecision(
            tier=force_tier, reason="forced", confidence=1.0
        )
    else:
        decision = classify_query(user_message, system_prompt)

    logger.info(
        "routing",
        extra={
            "tier": decision.tier.value,
            "reason": decision.reason,
            "confidence": decision.confidence,
        },
    )

    try:
        response = await router.acompletion(
            model=decision.tier.value,
            messages=messages,
            **kwargs,
        )
        return {
            "response": response,
            "tier_used": decision.tier.value,
            "routing_reason": decision.reason,
        }
    except Exception as e:
        logger.error(f"Routing failed for tier {decision.tier.value}: {e}")
        # Escalate to frontier on failure
        if decision.tier != ModelTier.FRONTIER:
            logger.info("Escalating to frontier tier after failure")
            response = await router.acompletion(
                model=ModelTier.FRONTIER.value,
                messages=messages,
                **kwargs,
            )
            return {
                "response": response,
                "tier_used": "frontier",
                "routing_reason": "escalated_on_failure",
            }
        raise


# Usage
async def main():
    result = await routed_completion(
        messages=[
            {"role": "user", "content": "What is the capital of France?"}
        ]
    )
    print(f"Used tier: {result['tier_used']}")  # -> cheap
    print(result["response"].choices[0].message.content)

Fallback and Retry Configuration

When a provider goes down or rate-limits you, LiteLLM handles retries automatically within a group. For cross-tier fallbacks, configure fallbacks in the router:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
router = Router(
    model_list=model_list,
    routing_strategy="least-busy",
    fallbacks=[
        {"cheap": ["mid"]},    # cheap fails -> try mid
        {"mid": ["frontier"]}, # mid fails -> try frontier
    ],
    allowed_fails=2,
    cooldown_time=60,       # seconds before retrying a failed deployment
    num_retries=3,
    retry_after=5,          # minimum wait between retries
)

The cooldown_time is important: if a deployment fails allowed_fails times, LiteLLM pulls it from rotation for cooldown_time seconds. Without this, a degraded provider will keep getting requests and piling up latency.

Cost Monitoring

Track spend by tier so you know if your routing decisions are working:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
import litellm

# LiteLLM exposes cost data on every response
async def routed_completion_with_cost(messages, **kwargs):
    result = await routed_completion(messages, **kwargs)
    response = result["response"]

    # Cost is calculated automatically from token usage
    cost = litellm.completion_cost(completion_response=response)

    logger.info(
        "completion_cost",
        extra={
            "tier": result["tier_used"],
            "cost_usd": cost,
            "input_tokens": response.usage.prompt_tokens,
            "output_tokens": response.usage.completion_tokens,
        },
    )
    return result

Ship these logs to your observability stack (Datadog, Grafana, whatever you use). Build a dashboard that shows cost per tier per day. You’ll quickly see whether your routing heuristics are doing their job — if 70% of traffic is hitting the frontier tier, your classifier is too conservative.

Tuning the Classifier

The keyword list approach is a starting point. In production, you’ll want to refine it:

  • Shadow mode: Run both the classifier and a ground-truth model (ask GPT-4o to rate each query’s complexity 1-5). Compare classifications weekly and adjust thresholds.
  • Task-type routing: If your app has known task types (summarize, classify, generate), add a task_type parameter to routed_completion and map task types directly to tiers. This is more reliable than heuristics.
  • Confidence thresholds: If confidence < 0.6, route to mid instead of cheap to reduce misclassification risk during uncertain cases.
  • Feedback loop: If users rate responses negatively and the request was routed to cheap, flag it for review. Use this data to tune keyword lists.

Common Pitfalls

Not accounting for context window. If you’re including conversation history, the total token count can push a “simple” query past your cheap model’s context limit. Always count tokens in the full messages array, not just the latest user turn.

Cheap models don’t follow complex instructions well. System prompts with many bullet points, conditional logic, or JSON schemas work poorly on smaller models. Keep system prompts simple for cheap-tier routes, or maintain separate system prompts per tier.

Latency can surprise you. Cheap models are usually faster, but Gemini Flash can occasionally have higher p99 latency than GPT-4o. Always measure tail latency per tier, not just average latency.