Every LLM call costs money. If your application sends the same question – or a slightly rephrased version of it – hundreds of times a day, you are burning cash for no reason. A semantic inference cache fixes this by recognizing that “What is Python?” and “What’s Python?” should return the same cached answer instead of hitting the model twice.

Exact-Match Caching: The Starting Point

The simplest cache is a direct key-value lookup in Redis. Hash the prompt, store the response.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import hashlib
import json
import redis
from openai import OpenAI

client = OpenAI()
r = redis.Redis(host="localhost", port=6379, db=0, decode_responses=True)

CACHE_TTL = 3600  # 1 hour

def get_cache_key(prompt: str, model: str) -> str:
    raw = f"{model}:{prompt}"
    return f"llm:exact:{hashlib.sha256(raw.encode()).hexdigest()}"

def cached_completion(prompt: str, model: str = "gpt-4o-mini") -> str:
    key = get_cache_key(prompt, model)
    cached = r.get(key)
    if cached:
        return cached

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
    )
    result = response.choices[0].message.content
    r.setex(key, CACHE_TTL, result)
    return result

This works fine for identical prompts. But it falls apart fast.

Why Exact-Match Fails

Consider these three prompts:

  • “What is Python?”
  • “What’s Python?”
  • “Explain Python to me”

All three produce the same SHA-256 hash? No. Each one is a completely different string, so each one generates a unique cache key. Three API calls, three charges, one answer.

In production, you will see this pattern constantly. Users rephrase, typos happen, and slight variations in system prompts or formatting create cache misses on what should be hits. Exact-match caching catches maybe 10-20% of redundant calls. Semantic caching catches 40-60%.

Semantic Hashing with Embeddings

The fix: embed the prompt into a vector, then hash the vector into a bucket. Similar prompts produce similar embeddings, which land in the same bucket.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
import numpy as np
from openai import OpenAI

client = OpenAI()

def get_embedding(text: str) -> np.ndarray:
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text,
    )
    return np.array(response.data[0].embedding, dtype=np.float32)

The text-embedding-3-small model returns a 1536-dimensional vector. Two semantically similar strings will have high cosine similarity – typically above 0.92 for paraphrases.

But you cannot use a 1536-float vector as a Redis key directly. You need to collapse it into a short, deterministic hash that preserves similarity. That is where locality-sensitive hashing comes in.

Locality-Sensitive Hashing for Approximate Matching

Locality-sensitive hashing (LSH) projects high-dimensional vectors into low-dimensional binary codes. Nearby vectors in the original space get the same binary code with high probability.

The simplest approach is random hyperplane LSH. Generate a set of random hyperplanes, then check which side of each hyperplane the vector falls on. Each side is a bit. Concatenate the bits and you have your hash.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
import numpy as np
import hashlib

class LSHHasher:
    def __init__(self, input_dim: int = 1536, num_planes: int = 16, seed: int = 42):
        rng = np.random.RandomState(seed)
        self.planes = rng.randn(num_planes, input_dim).astype(np.float32)

    def hash_vector(self, vec: np.ndarray) -> str:
        projections = self.planes @ vec
        bits = (projections > 0).astype(int)
        bit_string = "".join(str(b) for b in bits)
        return hashlib.md5(bit_string.encode()).hexdigest()

With 16 hyperplanes, you get 2^16 = 65,536 possible buckets. That is enough granularity for most workloads. Increase num_planes to reduce false positives (different meanings landing in the same bucket) at the cost of more false negatives (similar meanings landing in different buckets).

The seed is fixed so the same hyperplanes are generated every time your application restarts. Without a fixed seed, your cache becomes useless after a restart because the hash function changes.

Full Semantic Cache Implementation

Here is the complete cache class that ties everything together. It tries semantic matching first, falls back to the LLM, and tracks hit rates.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
import hashlib
import json
import time
import numpy as np
import redis
from openai import OpenAI

class SemanticInferenceCache:
    def __init__(
        self,
        redis_host: str = "localhost",
        redis_port: int = 6379,
        ttl: int = 3600,
        num_planes: int = 16,
        embedding_model: str = "text-embedding-3-small",
        seed: int = 42,
    ):
        self.r = redis.Redis(
            host=redis_host, port=redis_port, db=0, decode_responses=True
        )
        self.client = OpenAI()
        self.ttl = ttl
        self.embedding_model = embedding_model

        rng = np.random.RandomState(seed)
        self.planes = rng.randn(num_planes, 1536).astype(np.float32)

        # Hit rate tracking keys
        self.hits_key = "cache:stats:hits"
        self.misses_key = "cache:stats:misses"

    def _get_embedding(self, text: str) -> np.ndarray:
        response = self.client.embeddings.create(
            model=self.embedding_model,
            input=text,
        )
        return np.array(response.data[0].embedding, dtype=np.float32)

    def _semantic_hash(self, vec: np.ndarray) -> str:
        projections = self.planes @ vec
        bits = (projections > 0).astype(int)
        bit_string = "".join(str(b) for b in bits)
        return f"llm:semantic:{hashlib.md5(bit_string.encode()).hexdigest()}"

    def _exact_hash(self, prompt: str, model: str) -> str:
        raw = f"{model}:{prompt}"
        return f"llm:exact:{hashlib.sha256(raw.encode()).hexdigest()}"

    def query(self, prompt: str, model: str = "gpt-4o-mini") -> dict:
        # Try exact match first (cheapest lookup)
        exact_key = self._exact_hash(prompt, model)
        cached = self.r.get(exact_key)
        if cached:
            self.r.incr(self.hits_key)
            return {"result": cached, "cache": "exact_hit"}

        # Try semantic match
        embedding = self._get_embedding(prompt)
        semantic_key = self._semantic_hash(embedding)
        cached = self.r.get(semantic_key)
        if cached:
            self.r.incr(self.hits_key)
            # Also store exact key so next identical request skips embedding
            self.r.setex(exact_key, self.ttl, cached)
            return {"result": cached, "cache": "semantic_hit"}

        # Cache miss -- call the LLM
        self.r.incr(self.misses_key)
        response = self.client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
        )
        result = response.choices[0].message.content

        # Store under both keys
        self.r.setex(exact_key, self.ttl, result)
        self.r.setex(semantic_key, self.ttl, result)

        return {"result": result, "cache": "miss"}

    def hit_rate(self) -> float:
        hits = int(self.r.get(self.hits_key) or 0)
        misses = int(self.r.get(self.misses_key) or 0)
        total = hits + misses
        if total == 0:
            return 0.0
        return hits / total

    def flush(self) -> int:
        """Delete all cache keys. Returns number of keys deleted."""
        exact_keys = self.r.keys("llm:exact:*")
        semantic_keys = self.r.keys("llm:semantic:*")
        all_keys = exact_keys + semantic_keys
        if all_keys:
            return self.r.delete(*all_keys)
        return 0

Usage is straightforward:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
cache = SemanticInferenceCache(redis_host="localhost", ttl=7200)

# First call -- cache miss, hits the LLM
result1 = cache.query("What is Python?")
print(result1["cache"])  # "miss"

# Identical call -- exact hit, no embedding or LLM call
result2 = cache.query("What is Python?")
print(result2["cache"])  # "exact_hit"

# Paraphrase -- semantic hit, one embedding call but no LLM call
result3 = cache.query("Explain what Python is")
print(result3["cache"])  # "semantic_hit"

print(f"Hit rate: {cache.hit_rate():.1%}")  # "Hit rate: 66.7%"

Tuning the Number of Hyperplanes

The num_planes parameter controls the tradeoff between precision and recall:

  • 8 planes (256 buckets): Aggressive caching. More semantic hits but higher risk of returning wrong answers for genuinely different questions.
  • 16 planes (65K buckets): Good default. Catches most paraphrases while keeping false positives low.
  • 24 planes (16M buckets): Conservative. Almost only catches near-identical phrasings. Useful when answer precision matters more than cost savings.

Start with 16 and adjust based on your hit rate and user complaints about wrong cached answers.

TTL Strategy

Set TTL based on how stable the underlying data is. For general knowledge questions, 24 hours is fine. For anything involving real-time data (stock prices, weather), keep TTL under 5 minutes or skip caching entirely. Use different TTL values for different prompt categories if your application can classify them.

Common Errors and Fixes

redis.exceptions.ConnectionError: Error 111 connecting to localhost:6379

Redis is not running. Start it:

1
2
3
4
5
6
7
8
# Linux
sudo systemctl start redis-server

# macOS (Homebrew)
brew services start redis

# Docker
docker run -d --name redis -p 6379:6379 redis:latest

openai.RateLimitError on embedding calls

You are hitting the embeddings API too hard. The semantic hash requires one embedding call per cache miss. If you have bursty traffic, batch your embedding calls or add a local in-memory LRU cache for embeddings themselves using functools.lru_cache on the prompt string.

Stale cache returning outdated answers

Lower your TTL or add a version prefix to your cache keys. When you change models or system prompts, bump the version:

1
2
3
4
5
def _semantic_hash(self, vec: np.ndarray, version: str = "v1") -> str:
    projections = self.planes @ vec
    bits = (projections > 0).astype(int)
    bit_string = "".join(str(b) for b in bits)
    return f"llm:semantic:{version}:{hashlib.md5(bit_string.encode()).hexdigest()}"

False positives: cache returns answers for the wrong question

Increase num_planes from 16 to 20 or 24. Alternatively, store the original prompt alongside the cached response and do a cosine similarity check before returning:

1
2
3
def _verify_similarity(self, vec1: np.ndarray, vec2: np.ndarray, threshold: float = 0.92) -> bool:
    cos_sim = np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))
    return cos_sim >= threshold

This adds one extra embedding lookup but eliminates false positives entirely. Worth it if your application serves user-facing answers where accuracy matters more than latency.

numpy dtype mismatch causing wrong hash results

Always cast embeddings to float32 explicitly. The OpenAI API returns float values that Python stores as float64. If your hyperplanes are float32 and your embedding is float64, the matrix multiply still works but rounding differences can flip bits near decision boundaries. Keep both as float32 for consistency.