The Quick Version#
Not every prompt needs GPT-4o or Claude Opus. Simple questions like “What’s the capital of France?” cost the same on a $60/M-token model as they do on a $0.25/M-token one. A semantic router classifies incoming prompts by complexity, then sends each to the cheapest model that can handle it well.
1
| pip install openai numpy scikit-learn
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
| import numpy as np
from openai import OpenAI
client = OpenAI()
# Define route examples — what kinds of prompts go where
routes = {
"simple": {
"model": "gpt-4o-mini",
"examples": [
"What is the capital of France?",
"Convert 100 celsius to fahrenheit",
"What does HTTP stand for?",
"List 5 colors",
"What year did World War 2 end?",
],
},
"code": {
"model": "gpt-4o",
"examples": [
"Write a Python function to merge two sorted lists",
"Debug this SQL query that's returning duplicates",
"Implement a binary search tree in TypeScript",
"Refactor this class to use the strategy pattern",
"Write unit tests for this API endpoint",
],
},
"reasoning": {
"model": "o1",
"examples": [
"Analyze the tradeoffs between microservices and monolith for our 50-person team",
"Why is my distributed system experiencing split-brain and how do I fix it?",
"Design a database schema for a multi-tenant SaaS with row-level security",
"Compare the CAP theorem implications of Cassandra vs CockroachDB",
"Prove that this algorithm runs in O(n log n) time",
],
},
}
def get_embedding(text: str) -> np.ndarray:
response = client.embeddings.create(model="text-embedding-3-small", input=text)
return np.array(response.data[0].embedding)
# Build route embeddings (do this once, cache the result)
route_embeddings = {}
for route_name, route_config in routes.items():
embeddings = [get_embedding(ex) for ex in route_config["examples"]]
route_embeddings[route_name] = {
"centroid": np.mean(embeddings, axis=0),
"model": route_config["model"],
}
def route_prompt(prompt: str) -> str:
"""Pick the best model for this prompt."""
prompt_emb = get_embedding(prompt)
best_route = max(
route_embeddings.items(),
key=lambda item: np.dot(prompt_emb, item[1]["centroid"])
/ (np.linalg.norm(prompt_emb) * np.linalg.norm(item[1]["centroid"])),
)
return best_route[0], best_route[1]["model"]
# Test it
route, model = route_prompt("Write a recursive fibonacci function in Rust")
print(f"Route: {route}, Model: {model}")
# Route: code, Model: gpt-4o
|
This saves 60-80% on API costs for workloads with mixed complexity. Simple factual queries go to mini models, code goes to capable models, and hard reasoning goes to the best available.
Building a Production Router with Thresholds#
The centroid approach works but doesn’t handle edge cases — what if a prompt doesn’t match any route well? Add a confidence threshold and a fallback.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
| from dataclasses import dataclass
@dataclass
class RouteResult:
route: str
model: str
confidence: float
def route_with_confidence(prompt: str, threshold: float = 0.75) -> RouteResult:
"""Route a prompt with confidence scoring and fallback."""
prompt_emb = get_embedding(prompt)
scores = {}
for route_name, route_data in route_embeddings.items():
centroid = route_data["centroid"]
similarity = np.dot(prompt_emb, centroid) / (
np.linalg.norm(prompt_emb) * np.linalg.norm(centroid)
)
scores[route_name] = similarity
best_route = max(scores, key=scores.get)
confidence = scores[best_route]
if confidence < threshold:
# Low confidence — fall back to the most capable model
return RouteResult(route="fallback", model="gpt-4o", confidence=confidence)
return RouteResult(
route=best_route,
model=route_embeddings[best_route]["model"],
confidence=confidence,
)
# Usage
result = route_with_confidence("Explain quantum entanglement to a 5-year-old")
print(f"Route: {result.route}, Model: {result.model}, Confidence: {result.confidence:.3f}")
|
Set the threshold based on your tolerance. Lower thresholds (0.6) route more aggressively to cheap models. Higher thresholds (0.85) use the fallback more often but avoid misroutes.
Classifier-Based Routing#
For higher accuracy, train a lightweight classifier instead of using centroid similarity. This works better when routes overlap or when you have enough labeled data.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
| from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
import pickle
# Collect embeddings and labels from route examples
X_train = []
y_train = []
for route_name, route_config in routes.items():
for example in route_config["examples"]:
X_train.append(get_embedding(example))
y_train.append(route_name)
X_train = np.array(X_train)
le = LabelEncoder()
y_encoded = le.fit_transform(y_train)
# Train a simple classifier
clf = LogisticRegression(max_iter=1000, multi_class="multinomial")
clf.fit(X_train, y_encoded)
# Save for production use
with open("router_model.pkl", "wb") as f:
pickle.dump({"classifier": clf, "label_encoder": le, "routes": routes}, f)
def classify_route(prompt: str) -> RouteResult:
"""Route using trained classifier with probability scores."""
emb = get_embedding(prompt).reshape(1, -1)
probs = clf.predict_proba(emb)[0]
best_idx = np.argmax(probs)
route_name = le.inverse_transform([best_idx])[0]
return RouteResult(
route=route_name,
model=routes[route_name]["model"],
confidence=float(probs[best_idx]),
)
result = classify_route("Help me optimize this PostgreSQL query with 3 JOINs")
print(f"{result.route} → {result.model} ({result.confidence:.2f})")
|
With 50+ examples per route, logistic regression outperforms centroid matching on ambiguous prompts. The training takes milliseconds and inference is sub-millisecond — negligible overhead per request.
End-to-End Router with Cost Tracking#
Tie the router to actual LLM calls and track how much you’re saving:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
| import time
# Model pricing per 1M tokens (approximate, input tokens)
PRICING = {
"gpt-4o-mini": 0.15,
"gpt-4o": 2.50,
"o1": 15.00,
}
class LLMRouter:
def __init__(self):
self.client = OpenAI()
self.total_cost = 0.0
self.total_cost_without_routing = 0.0
self.call_count = 0
def query(self, prompt: str, fallback_model: str = "gpt-4o") -> dict:
route = classify_route(prompt)
response = self.client.chat.completions.create(
model=route.model,
messages=[{"role": "user", "content": prompt}],
)
# Track costs
usage = response.usage
input_tokens = usage.prompt_tokens
output_tokens = usage.completion_tokens
actual_cost = (input_tokens * PRICING.get(route.model, 2.50)) / 1_000_000
baseline_cost = (input_tokens * PRICING[fallback_model]) / 1_000_000
self.total_cost += actual_cost
self.total_cost_without_routing += baseline_cost
self.call_count += 1
return {
"response": response.choices[0].message.content,
"model_used": route.model,
"route": route.route,
"confidence": route.confidence,
"cost": actual_cost,
"savings": baseline_cost - actual_cost,
}
def stats(self) -> dict:
saved = self.total_cost_without_routing - self.total_cost
pct = (saved / max(self.total_cost_without_routing, 1e-10)) * 100
return {
"total_calls": self.call_count,
"total_cost": f"${self.total_cost:.4f}",
"cost_without_routing": f"${self.total_cost_without_routing:.4f}",
"savings": f"${saved:.4f} ({pct:.1f}%)",
}
router = LLMRouter()
result = router.query("What is 2 + 2?")
print(f"Model: {result['model_used']}, Route: {result['route']}")
print(f"Savings: ${result['savings']:.6f}")
|
Common Errors and Fixes#
Router always picks the same route
Your example prompts are too similar across routes. Make them more distinct. Add at least 10 diverse examples per route and ensure there’s clear separation between categories.
Embedding API calls add latency
Cache embeddings for repeated prompts. Use text-embedding-3-small (fastest) instead of text-embedding-3-large. For sub-10ms routing, pre-compute route centroids and use a local embedding model like sentence-transformers/all-MiniLM-L6-v2.
Complex prompt gets sent to the cheap model
Your threshold is too low or the “simple” route examples overlap with complex ones. Increase the threshold to 0.85 and audit misrouted prompts. Add them as negative examples to the wrong route and positive examples to the correct one.
Cost tracking doesn’t match actual bills
Output tokens are typically 3-4x more expensive than input tokens. The example above only tracks input cost. Add output token pricing for accurate tracking.
When to Use Routing vs. a Single Model#
Use routing when you have high volume (1000+ calls/day) with mixed complexity. The 60-80% cost savings compound quickly at scale.
Stick with a single model when your workload is uniformly complex, when latency from the embedding call matters more than cost, or when you’re still figuring out what model works best for your use case.