The Quick Version

Not every prompt needs GPT-4o or Claude Opus. Simple questions like “What’s the capital of France?” cost the same on a $60/M-token model as they do on a $0.25/M-token one. A semantic router classifies incoming prompts by complexity, then sends each to the cheapest model that can handle it well.

1
pip install openai numpy scikit-learn
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
import numpy as np
from openai import OpenAI

client = OpenAI()

# Define route examples — what kinds of prompts go where
routes = {
    "simple": {
        "model": "gpt-4o-mini",
        "examples": [
            "What is the capital of France?",
            "Convert 100 celsius to fahrenheit",
            "What does HTTP stand for?",
            "List 5 colors",
            "What year did World War 2 end?",
        ],
    },
    "code": {
        "model": "gpt-4o",
        "examples": [
            "Write a Python function to merge two sorted lists",
            "Debug this SQL query that's returning duplicates",
            "Implement a binary search tree in TypeScript",
            "Refactor this class to use the strategy pattern",
            "Write unit tests for this API endpoint",
        ],
    },
    "reasoning": {
        "model": "o1",
        "examples": [
            "Analyze the tradeoffs between microservices and monolith for our 50-person team",
            "Why is my distributed system experiencing split-brain and how do I fix it?",
            "Design a database schema for a multi-tenant SaaS with row-level security",
            "Compare the CAP theorem implications of Cassandra vs CockroachDB",
            "Prove that this algorithm runs in O(n log n) time",
        ],
    },
}

def get_embedding(text: str) -> np.ndarray:
    response = client.embeddings.create(model="text-embedding-3-small", input=text)
    return np.array(response.data[0].embedding)

# Build route embeddings (do this once, cache the result)
route_embeddings = {}
for route_name, route_config in routes.items():
    embeddings = [get_embedding(ex) for ex in route_config["examples"]]
    route_embeddings[route_name] = {
        "centroid": np.mean(embeddings, axis=0),
        "model": route_config["model"],
    }

def route_prompt(prompt: str) -> str:
    """Pick the best model for this prompt."""
    prompt_emb = get_embedding(prompt)
    best_route = max(
        route_embeddings.items(),
        key=lambda item: np.dot(prompt_emb, item[1]["centroid"])
            / (np.linalg.norm(prompt_emb) * np.linalg.norm(item[1]["centroid"])),
    )
    return best_route[0], best_route[1]["model"]

# Test it
route, model = route_prompt("Write a recursive fibonacci function in Rust")
print(f"Route: {route}, Model: {model}")
# Route: code, Model: gpt-4o

This saves 60-80% on API costs for workloads with mixed complexity. Simple factual queries go to mini models, code goes to capable models, and hard reasoning goes to the best available.

Building a Production Router with Thresholds

The centroid approach works but doesn’t handle edge cases — what if a prompt doesn’t match any route well? Add a confidence threshold and a fallback.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
from dataclasses import dataclass

@dataclass
class RouteResult:
    route: str
    model: str
    confidence: float

def route_with_confidence(prompt: str, threshold: float = 0.75) -> RouteResult:
    """Route a prompt with confidence scoring and fallback."""
    prompt_emb = get_embedding(prompt)

    scores = {}
    for route_name, route_data in route_embeddings.items():
        centroid = route_data["centroid"]
        similarity = np.dot(prompt_emb, centroid) / (
            np.linalg.norm(prompt_emb) * np.linalg.norm(centroid)
        )
        scores[route_name] = similarity

    best_route = max(scores, key=scores.get)
    confidence = scores[best_route]

    if confidence < threshold:
        # Low confidence — fall back to the most capable model
        return RouteResult(route="fallback", model="gpt-4o", confidence=confidence)

    return RouteResult(
        route=best_route,
        model=route_embeddings[best_route]["model"],
        confidence=confidence,
    )

# Usage
result = route_with_confidence("Explain quantum entanglement to a 5-year-old")
print(f"Route: {result.route}, Model: {result.model}, Confidence: {result.confidence:.3f}")

Set the threshold based on your tolerance. Lower thresholds (0.6) route more aggressively to cheap models. Higher thresholds (0.85) use the fallback more often but avoid misroutes.

Classifier-Based Routing

For higher accuracy, train a lightweight classifier instead of using centroid similarity. This works better when routes overlap or when you have enough labeled data.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
import pickle

# Collect embeddings and labels from route examples
X_train = []
y_train = []

for route_name, route_config in routes.items():
    for example in route_config["examples"]:
        X_train.append(get_embedding(example))
        y_train.append(route_name)

X_train = np.array(X_train)
le = LabelEncoder()
y_encoded = le.fit_transform(y_train)

# Train a simple classifier
clf = LogisticRegression(max_iter=1000, multi_class="multinomial")
clf.fit(X_train, y_encoded)

# Save for production use
with open("router_model.pkl", "wb") as f:
    pickle.dump({"classifier": clf, "label_encoder": le, "routes": routes}, f)

def classify_route(prompt: str) -> RouteResult:
    """Route using trained classifier with probability scores."""
    emb = get_embedding(prompt).reshape(1, -1)
    probs = clf.predict_proba(emb)[0]
    best_idx = np.argmax(probs)
    route_name = le.inverse_transform([best_idx])[0]

    return RouteResult(
        route=route_name,
        model=routes[route_name]["model"],
        confidence=float(probs[best_idx]),
    )

result = classify_route("Help me optimize this PostgreSQL query with 3 JOINs")
print(f"{result.route}{result.model} ({result.confidence:.2f})")

With 50+ examples per route, logistic regression outperforms centroid matching on ambiguous prompts. The training takes milliseconds and inference is sub-millisecond — negligible overhead per request.

End-to-End Router with Cost Tracking

Tie the router to actual LLM calls and track how much you’re saving:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import time

# Model pricing per 1M tokens (approximate, input tokens)
PRICING = {
    "gpt-4o-mini": 0.15,
    "gpt-4o": 2.50,
    "o1": 15.00,
}

class LLMRouter:
    def __init__(self):
        self.client = OpenAI()
        self.total_cost = 0.0
        self.total_cost_without_routing = 0.0
        self.call_count = 0

    def query(self, prompt: str, fallback_model: str = "gpt-4o") -> dict:
        route = classify_route(prompt)

        response = self.client.chat.completions.create(
            model=route.model,
            messages=[{"role": "user", "content": prompt}],
        )

        # Track costs
        usage = response.usage
        input_tokens = usage.prompt_tokens
        output_tokens = usage.completion_tokens
        actual_cost = (input_tokens * PRICING.get(route.model, 2.50)) / 1_000_000
        baseline_cost = (input_tokens * PRICING[fallback_model]) / 1_000_000

        self.total_cost += actual_cost
        self.total_cost_without_routing += baseline_cost
        self.call_count += 1

        return {
            "response": response.choices[0].message.content,
            "model_used": route.model,
            "route": route.route,
            "confidence": route.confidence,
            "cost": actual_cost,
            "savings": baseline_cost - actual_cost,
        }

    def stats(self) -> dict:
        saved = self.total_cost_without_routing - self.total_cost
        pct = (saved / max(self.total_cost_without_routing, 1e-10)) * 100
        return {
            "total_calls": self.call_count,
            "total_cost": f"${self.total_cost:.4f}",
            "cost_without_routing": f"${self.total_cost_without_routing:.4f}",
            "savings": f"${saved:.4f} ({pct:.1f}%)",
        }

router = LLMRouter()
result = router.query("What is 2 + 2?")
print(f"Model: {result['model_used']}, Route: {result['route']}")
print(f"Savings: ${result['savings']:.6f}")

Common Errors and Fixes

Router always picks the same route

Your example prompts are too similar across routes. Make them more distinct. Add at least 10 diverse examples per route and ensure there’s clear separation between categories.

Embedding API calls add latency

Cache embeddings for repeated prompts. Use text-embedding-3-small (fastest) instead of text-embedding-3-large. For sub-10ms routing, pre-compute route centroids and use a local embedding model like sentence-transformers/all-MiniLM-L6-v2.

Complex prompt gets sent to the cheap model

Your threshold is too low or the “simple” route examples overlap with complex ones. Increase the threshold to 0.85 and audit misrouted prompts. Add them as negative examples to the wrong route and positive examples to the correct one.

Cost tracking doesn’t match actual bills

Output tokens are typically 3-4x more expensive than input tokens. The example above only tracks input cost. Add output token pricing for accurate tracking.

When to Use Routing vs. a Single Model

Use routing when you have high volume (1000+ calls/day) with mixed complexity. The 60-80% cost savings compound quickly at scale.

Stick with a single model when your workload is uniformly complex, when latency from the embedding call matters more than cost, or when you’re still figuring out what model works best for your use case.