If you’re sending the same system prompt, tool definitions, or document context on every API call, you’re paying full price for tokens the model has already seen. Prompt caching fixes this. Anthropic’s cache reads cost 10% of the base input price. OpenAI gives a 50% discount on cached tokens automatically. Either way, the savings on high-volume workloads are significant.
Here’s what you need to know to implement it on both platforms.
How Prompt Caching Works
The core idea is simple: the API stores a prefix of your prompt after the first request. On subsequent requests with an identical prefix, the provider reuses the stored representation instead of reprocessing those tokens from scratch. You pay a reduced rate for the cached portion and get faster time-to-first-token.
There are two approaches across providers:
- Anthropic (explicit): You mark content blocks with
cache_controlto tell the API what to cache. You get fine-grained control over caching behavior with up to 4 breakpoints. - OpenAI (automatic): Caching kicks in automatically for prompts over 1024 tokens. No code changes needed – the system caches the longest matching prefix behind the scenes.
Both require the prompt prefix to be byte-for-byte identical across requests. Even a whitespace change breaks the cache.
Anthropic: Explicit Cache Control
With the Anthropic API, you add "cache_control": {"type": "ephemeral"} to the content blocks you want cached. The cache hierarchy follows the order: tools -> system -> messages. Anything before the cache breakpoint gets cached as a unit.
| |
The first call writes to cache and you’ll see cache_creation_input_tokens populated. The second call with the same prefix hits cache, and cache_read_input_tokens shows the reused tokens:
| |
Anthropic Pricing Math
Cache writes cost 1.25x the base input price. Cache reads cost 0.1x. For Claude Sonnet 4 at $3/MTok base input:
| Token type | Cost per MTok |
|---|---|
| Base input | $3.00 |
| Cache write | $3.75 |
| Cache read | $0.30 |
If you cache 50,000 tokens and read from cache 20 times before it expires, you pay $3.75 for the write and $0.30 x 20 = $6.00 for the reads. Without caching, you’d pay $3.00 x 21 = $63.00. That’s a 84% cost reduction.
Minimum Token Requirements
Not every prompt qualifies for caching. The minimums vary by model:
- Claude Sonnet 4 / Opus 4.1: 1,024 tokens
- Claude Opus 4.5 / Opus 4.6: 4,096 tokens
- Claude Haiku 4.5: 4,096 tokens
- Claude Haiku 3: 2,048 tokens
If your cached prefix is shorter than the threshold, the request processes normally without caching – no error, just no cache.
OpenAI: Automatic Caching
OpenAI’s approach requires zero code changes. Any prompt over 1,024 tokens gets automatically cached. The system stores the longest matching prefix and serves it on subsequent identical requests.
| |
When caching works, response.usage.prompt_tokens_details.cached_tokens will be a non-zero value. Cached tokens cost 50% of the regular input price – not as steep a discount as Anthropic’s 90%, but it requires no opt-in or code changes.
OpenAI Cache Duration
OpenAI caches are kept for 5-10 minutes of inactivity and always cleared within one hour. For sustained workloads, the cache stays warm naturally. For bursty traffic, you may need a warm-up request.
Multi-Turn Conversations
Prompt caching really shines in multi-turn conversations where the system prompt and conversation history grow with each turn. With Anthropic, place cache_control on the last message of the conversation so far:
| |
Each subsequent turn reuses the cached prefix of the conversation, so you only pay full price for the new turn’s tokens. In a 20-turn conversation with a 4,000-token system prompt, this saves you from reprocessing ~80,000+ cumulative tokens.
Common Errors and Fixes
“AttributeError: ‘Beta’ object has no attribute ‘prompt_caching’”
This happens when using outdated Anthropic SDK code. Prompt caching left beta – drop the beta prefix:
| |
Cache Misses When You Expect Hits
If cache_read_input_tokens stays at 0 on follow-up calls, check these:
- Prompt changed: Even a single whitespace difference breaks the cache. Ensure your cached prefix is byte-identical across requests.
- Cache expired: Default TTL is 5 minutes of inactivity. If your requests are more than 5 minutes apart, the cache is gone. Use Anthropic’s 1-hour TTL option (
"ttl": "1h") for slower workflows. - Below minimum tokens: Your cached content must exceed the model’s minimum threshold (1,024 or 4,096 depending on model).
- Concurrent requests: A cache entry only becomes available after the first response starts generating. If you fire 10 parallel requests before the first one returns, none of them benefit from caching. Send one warm-up request first, wait for the response to begin, then send the batch.
- Unstable JSON key ordering: Some languages (Go, Swift) randomize dictionary key order during JSON serialization. If your tool definitions serialize differently each time, the cache breaks every request.
OpenAI Shows 0 Cached Tokens
OpenAI requires at least 1,024 prompt tokens for caching to activate. If your prompt is shorter, cached_tokens will always be 0. Also, OpenAI caches are not guaranteed – they’re best-effort and depend on server-side routing.
When Prompt Caching Pays Off
Caching makes the biggest difference when:
- Large static context: System prompts, RAG documents, or tool definitions over 1,024 tokens that repeat across calls.
- Multi-turn conversations: Each turn reuses the growing conversation prefix.
- Batch processing: Running the same prompt template against many inputs (e.g., analyzing 500 legal contracts with the same instructions).
- Agent loops: Tool-using agents that make 5-20 API calls per task with identical system prompts and tool definitions.
It’s less useful for one-shot requests with unique prompts or very short system instructions under the minimum token threshold.
Related Guides
- How to Implement Streaming Responses from LLM APIs
- How to Build Prompt Caching Strategies for Multi-Turn LLM Sessions
- How to Build Prompt Chains with Async LLM Calls and Batching
- How to Build Prompt Templates with Python F-Strings and Chat Markup
- How to Route Prompts to the Best LLM with a Semantic Router
- How to Build Prompt Versioning and Regression Testing for LLMs
- How to Build Context-Aware Prompt Routing with Embeddings
- How to Build Token-Efficient Prompt Batching with LLM APIs
- How to Build Prompt Fallback Chains with Automatic Model Switching
- How to Build Dynamic Prompt Routers with LLM Cascading