Every LLM call costs money. If your application sends the same question – or a slightly rephrased version of it – hundreds of times a day, you are burning cash for no reason. A semantic inference cache fixes this by recognizing that “What is Python?” and “What’s Python?” should return the same cached answer instead of hitting the model twice.
Exact-Match Caching: The Starting Point
The simplest cache is a direct key-value lookup in Redis. Hash the prompt, store the response.
| |
This works fine for identical prompts. But it falls apart fast.
Why Exact-Match Fails
Consider these three prompts:
- “What is Python?”
- “What’s Python?”
- “Explain Python to me”
All three produce the same SHA-256 hash? No. Each one is a completely different string, so each one generates a unique cache key. Three API calls, three charges, one answer.
In production, you will see this pattern constantly. Users rephrase, typos happen, and slight variations in system prompts or formatting create cache misses on what should be hits. Exact-match caching catches maybe 10-20% of redundant calls. Semantic caching catches 40-60%.
Semantic Hashing with Embeddings
The fix: embed the prompt into a vector, then hash the vector into a bucket. Similar prompts produce similar embeddings, which land in the same bucket.
| |
The text-embedding-3-small model returns a 1536-dimensional vector. Two semantically similar strings will have high cosine similarity – typically above 0.92 for paraphrases.
But you cannot use a 1536-float vector as a Redis key directly. You need to collapse it into a short, deterministic hash that preserves similarity. That is where locality-sensitive hashing comes in.
Locality-Sensitive Hashing for Approximate Matching
Locality-sensitive hashing (LSH) projects high-dimensional vectors into low-dimensional binary codes. Nearby vectors in the original space get the same binary code with high probability.
The simplest approach is random hyperplane LSH. Generate a set of random hyperplanes, then check which side of each hyperplane the vector falls on. Each side is a bit. Concatenate the bits and you have your hash.
| |
With 16 hyperplanes, you get 2^16 = 65,536 possible buckets. That is enough granularity for most workloads. Increase num_planes to reduce false positives (different meanings landing in the same bucket) at the cost of more false negatives (similar meanings landing in different buckets).
The seed is fixed so the same hyperplanes are generated every time your application restarts. Without a fixed seed, your cache becomes useless after a restart because the hash function changes.
Full Semantic Cache Implementation
Here is the complete cache class that ties everything together. It tries semantic matching first, falls back to the LLM, and tracks hit rates.
| |
Usage is straightforward:
| |
Tuning the Number of Hyperplanes
The num_planes parameter controls the tradeoff between precision and recall:
- 8 planes (256 buckets): Aggressive caching. More semantic hits but higher risk of returning wrong answers for genuinely different questions.
- 16 planes (65K buckets): Good default. Catches most paraphrases while keeping false positives low.
- 24 planes (16M buckets): Conservative. Almost only catches near-identical phrasings. Useful when answer precision matters more than cost savings.
Start with 16 and adjust based on your hit rate and user complaints about wrong cached answers.
TTL Strategy
Set TTL based on how stable the underlying data is. For general knowledge questions, 24 hours is fine. For anything involving real-time data (stock prices, weather), keep TTL under 5 minutes or skip caching entirely. Use different TTL values for different prompt categories if your application can classify them.
Common Errors and Fixes
redis.exceptions.ConnectionError: Error 111 connecting to localhost:6379
Redis is not running. Start it:
| |
openai.RateLimitError on embedding calls
You are hitting the embeddings API too hard. The semantic hash requires one embedding call per cache miss. If you have bursty traffic, batch your embedding calls or add a local in-memory LRU cache for embeddings themselves using functools.lru_cache on the prompt string.
Stale cache returning outdated answers
Lower your TTL or add a version prefix to your cache keys. When you change models or system prompts, bump the version:
| |
False positives: cache returns answers for the wrong question
Increase num_planes from 16 to 20 or 24. Alternatively, store the original prompt alongside the cached response and do a cosine similarity check before returning:
| |
This adds one extra embedding lookup but eliminates false positives entirely. Worth it if your application serves user-facing answers where accuracy matters more than latency.
numpy dtype mismatch causing wrong hash results
Always cast embeddings to float32 explicitly. The OpenAI API returns float values that Python stores as float64. If your hyperplanes are float32 and your embedding is float64, the matrix multiply still works but rounding differences can flip bits near decision boundaries. Keep both as float32 for consistency.
Related Guides
- How to Build a Model Training Queue with Redis and Worker Pools
- How to Build a Model Training Scheduler with Priority Queues and GPU Allocation
- How to Build a Model Training Cost Calculator with Cloud Pricing APIs
- How to Build a Model Inference Queue with Celery and Redis
- How to Build a Model Training Checkpoint Pipeline with PyTorch
- How to Build a Model Experiment Tracking Pipeline with MLflow and DuckDB
- How to Scale ML Training and Inference with Ray
- How to Build a Model Training Pipeline with Composer and FSDP
- How to Build a Model Artifact Cache with S3 and Local Fallback
- How to Build a Model Registry with S3 and DynamoDB