How to Build Context-Aware Prompt Routing with Embeddings

Running every prompt through the same model is wasteful. A basic factual question doesn’t need the same horsepower as a multi-step math derivation, and a creative writing request has different needs than a debugging session. By encoding your route categories as embeddings and comparing incoming prompts against them, you can automatically send each query to the model that handles it best – cutting costs on easy tasks and improving quality on hard ones.

The approach here uses sentence-transformers for local embedding (no API calls needed for classification), cosine similarity for matching, and the OpenAI API for the actual LLM calls. Everything runs in a single Python script.

1
pip install sentence-transformers numpy scikit-learn openai

Define Your Route Categories

Each route needs a name, a target model, and a handful of example prompts that represent the kind of query it should handle. More examples per route means better classification. Five to ten examples is a good starting point.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
routes = {
    "coding": {
        "model": "gpt-4o",
        "system_prompt": "You are an expert software engineer. Write clean, production-ready code.",
        "examples": [
            "Write a Python function that finds the longest palindromic substring",
            "How do I set up a PostgreSQL connection pool in asyncpg?",
            "Debug this React component that re-renders on every keystroke",
            "Convert this synchronous Python code to use async/await",
            "Write a GitHub Actions workflow that runs pytest on pull requests",
            "Implement rate limiting middleware for a FastAPI application",
        ],
    },
    "creative": {
        "model": "gpt-4o",
        "system_prompt": "You are a creative writer. Be vivid, original, and engaging.",
        "examples": [
            "Write a short story about a robot discovering music for the first time",
            "Come up with 10 taglines for a coffee brand aimed at programmers",
            "Rewrite this paragraph to sound more conversational",
            "Write a poem about debugging at 3am",
            "Help me brainstorm names for a tech startup in the sustainability space",
        ],
    },
    "factual": {
        "model": "gpt-4o-mini",
        "system_prompt": "Answer factual questions concisely and accurately.",
        "examples": [
            "What is the population of Tokyo?",
            "When was the Treaty of Versailles signed?",
            "What does the acronym REST stand for?",
            "How many bytes are in a kilobyte?",
            "What programming language is the Linux kernel written in?",
            "Who invented the transistor?",
        ],
    },
    "reasoning": {
        "model": "gpt-4o",
        "system_prompt": "Think step by step. Show your reasoning before giving a final answer.",
        "examples": [
            "If a train leaves Chicago at 60mph and another leaves New York at 80mph, when do they meet?",
            "Prove that the square root of 2 is irrational",
            "What are the tradeoffs of using a B-tree vs a hash index in a database?",
            "Analyze the time complexity of this recursive function",
            "A farmer has 100 meters of fencing. What rectangle dimensions maximize area?",
        ],
    },
}

Notice each route also carries a system_prompt. This matters – once you know a query is about coding, you can prime the model with a specialized system message instead of a generic one.

Build the Embedding Router

Here’s the core of the system. We use sentence-transformers to embed all the example prompts locally, compute a centroid for each route, then classify new prompts by cosine similarity against those centroids.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

class PromptRouter:
    def __init__(self, routes: dict):
        self.model = SentenceTransformer("all-MiniLM-L6-v2")
        self.routes = routes
        self.route_centroids = {}
        self._build_centroids()

    def _build_centroids(self):
        for route_name, config in self.routes.items():
            embeddings = self.model.encode(config["examples"])
            centroid = np.mean(embeddings, axis=0)
            self.route_centroids[route_name] = centroid

    def classify(self, prompt: str) -> tuple[str, float]:
        prompt_embedding = self.model.encode([prompt])
        best_route = None
        best_score = -1.0

        for route_name, centroid in self.route_centroids.items():
            score = cosine_similarity(
                prompt_embedding, centroid.reshape(1, -1)
            )[0][0]
            if score > best_score:
                best_score = score
                best_route = route_name

        return best_route, float(best_score)

router = PromptRouter(routes)

# Test it
test_prompts = [
    "Write a binary search in Rust",
    "What year did the Berlin Wall fall?",
    "Write a haiku about machine learning",
    "If I flip a fair coin 10 times, what's the probability of exactly 7 heads?",
]

for prompt in test_prompts:
    route, score = router.classify(prompt)
    print(f"[{route} ({score:.3f})] {prompt}")

Expected output looks something like:

1
2
3
4
[coding (0.612)] Write a binary search in Rust
[factual (0.578)] What year did the Berlin Wall fall?
[creative (0.534)] Write a haiku about machine learning
[reasoning (0.571)] If I flip a fair coin 10 times, what's the probability of exactly 7 heads?

The all-MiniLM-L6-v2 model is fast (80+ MB, runs on CPU) and produces 384-dimensional embeddings. For most routing tasks, this is more than enough. If you need higher accuracy, try all-mpnet-base-v2 – it’s slower but produces better embeddings.

Connect Routes to LLM Backends

Now wire the router to actual model calls. Each route maps to a specific OpenAI model and system prompt.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
from openai import OpenAI

client = OpenAI()

def route_and_call(prompt: str, router: PromptRouter) -> str:
    route_name, confidence = router.classify(prompt)
    route_config = routes[route_name]

    print(f"Routing to '{route_name}' ({route_config['model']}) "
          f"with confidence {confidence:.3f}")

    response = client.chat.completions.create(
        model=route_config["model"],
        messages=[
            {"role": "system", "content": route_config["system_prompt"]},
            {"role": "user", "content": prompt},
        ],
        temperature=0.7,
    )
    return response.choices[0].message.content

# Route a coding question to gpt-4o
answer = route_and_call("Write a Python decorator that retries failed functions 3 times", router)
print(answer)

# Route a factual question to gpt-4o-mini (cheaper)
answer = route_and_call("What is the speed of light in meters per second?", router)
print(answer)

The factual question goes to gpt-4o-mini at a fraction of the cost. The coding question gets the full gpt-4o treatment. You’re paying for power only when you need it.

Add Confidence Thresholds

Sometimes a prompt doesn’t fit any route well. Maybe it’s a hybrid question or something completely outside your defined categories. You need a fallback.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
def route_and_call_with_fallback(
    prompt: str,
    router: PromptRouter,
    confidence_threshold: float = 0.35,
    fallback_model: str = "gpt-4o",
) -> str:
    route_name, confidence = router.classify(prompt)

    if confidence < confidence_threshold:
        print(f"Low confidence ({confidence:.3f}). Using fallback model: {fallback_model}")
        response = client.chat.completions.create(
            model=fallback_model,
            messages=[
                {"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": prompt},
            ],
            temperature=0.7,
        )
    else:
        route_config = routes[route_name]
        print(f"Routing to '{route_name}' ({route_config['model']}) "
              f"with confidence {confidence:.3f}")
        response = client.chat.completions.create(
            model=route_config["model"],
            messages=[
                {"role": "system", "content": route_config["system_prompt"]},
                {"role": "user", "content": prompt},
            ],
            temperature=0.7,
        )

    return response.choices[0].message.content

# An ambiguous prompt that might not match well
answer = route_and_call_with_fallback(
    "What's the best laptop for video editing under $1500?",
    router,
)
print(answer)

Picking the right threshold takes experimentation. Start at 0.35 and adjust. Log your confidence scores in production for a few days, then look at the distribution. If your routes are well-defined, most scores land between 0.45 and 0.70. Anything consistently below 0.35 is likely out-of-domain.

You can also add a secondary check: instead of just the top match, compare the gap between the top two routes. If they’re within 0.02 of each other, treat it as ambiguous and use the fallback.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
def classify_with_gap_check(
    prompt: str,
    router: PromptRouter,
    min_gap: float = 0.02,
) -> tuple[str, float, bool]:
    prompt_embedding = router.model.encode([prompt])
    scores = {}

    for route_name, centroid in router.route_centroids.items():
        score = cosine_similarity(
            prompt_embedding, centroid.reshape(1, -1)
        )[0][0]
        scores[route_name] = float(score)

    sorted_routes = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    top_route, top_score = sorted_routes[0]
    runner_up_score = sorted_routes[1][1]

    is_ambiguous = (top_score - runner_up_score) < min_gap
    return top_route, top_score, is_ambiguous

Common Errors and Fixes

RuntimeError: No CUDA GPUs are available when loading sentence-transformers

This happens when PyTorch defaults to looking for a GPU. Sentence-transformers runs fine on CPU for routing workloads. Force CPU explicitly:

1
model = SentenceTransformer("all-MiniLM-L6-v2", device="cpu")

openai.AuthenticationError: Incorrect API key provided

Your OPENAI_API_KEY environment variable isn’t set or is wrong. Set it before running:

1
export OPENAI_API_KEY="sk-..."

Or pass it directly to the client:

1
client = OpenAI(api_key="sk-your-key-here")

ValueError: shapes (1,384) and (1,768) not aligned in cosine similarity

You changed the sentence-transformers model mid-run without rebuilding centroids. Different models produce different embedding dimensions (all-MiniLM-L6-v2 gives 384, all-mpnet-base-v2 gives 768). Rebuild centroids whenever you swap models:

1
router = PromptRouter(routes)  # Rebuilds centroids with the new model

Low accuracy – all prompts route to the same category

Your example prompts are too similar across categories, or you have too few examples. Add more diverse examples to each route, aiming for at least 6-8 per category. Also check that your categories are actually distinct. If “coding” and “reasoning” share many overlapping examples, merge them or sharpen the boundary.

Define Your Route Categories#

Build the Embedding Router#

Connect Routes to LLM Backends#

Add Confidence Thresholds#

Common Errors and Fixes#

Related Guides#

About the Author

Define Your Route Categories

Build the Embedding Router

Connect Routes to LLM Backends

Add Confidence Thresholds

Common Errors and Fixes

Related Guides