Every time you send a large system prompt, a 50-page document, or a set of tool definitions to the Claude API, you’re paying full price for tokens the model has already processed. Anthropic’s prompt caching lets you store those static prefixes and reuse them across requests. Cached reads cost just 10% of the base input token price – a 90% discount. The tradeoff is a 25% surcharge on the first request that writes the cache. For any workload where the same context appears more than twice, you come out ahead.
How Prompt Caching Works
You mark content blocks with "cache_control": {"type": "ephemeral"} to create cache breakpoints. On the first request, the API processes the full prompt and writes everything up to the breakpoint into cache. On subsequent requests with an identical prefix, the API reads from cache instead of reprocessing.
Key details:
- Cache hierarchy: The API caches in order –
tools, thensystem, thenmessages. Changing anything earlier in the chain invalidates everything after it. - Minimum token requirements: The cached prefix must meet a minimum size or caching silently skips. For Claude Sonnet 4 and Opus 4.1, the minimum is 1,024 tokens. For Claude Opus 4.5, it’s 4,096 tokens. Claude Haiku 3 requires 2,048 tokens.
- TTL: Cached entries last 5 minutes by default, refreshed each time they’re read. Anthropic also offers a 1-hour TTL at 2x the base input price for cache writes (instead of 1.25x). Set it with
"cache_control": {"type": "ephemeral", "ttl": "1h"}. - Breakpoint limit: You can set up to 4
cache_controlbreakpoints per request. - Automatic lookback: The system checks up to 20 blocks before each breakpoint for cache hits, working backwards to find the longest matching prefix.
Caching a Large System Prompt
The most common pattern is caching a large system prompt that stays the same across requests. Pass the system prompt as an array of content blocks and add cache_control on the last block you want cached.
| |
The first request writes the system prompt to cache. Every subsequent request with the same system prompt prefix reads from cache at 10% cost. Your user message can change freely – only the prefix up to the cache_control breakpoint needs to stay identical.
Caching Documents for Multi-Turn Conversations
In a multi-turn chat where users ask questions about a document, you want the document cached across the entire conversation. Place the document in the system prompt with cache_control, then add a second breakpoint on the last message in the conversation to incrementally cache the growing history.
| |
On turn 2, the system prompt hits the cache (you’ll see cache_read_input_tokens populated), and the conversation up to the second breakpoint gets written to a new cache entry. On turn 3, both the system prompt and the first two turns get read from cache. Each subsequent turn only pays full price for the new user message.
Caching Tool Definitions
If you’re building an agent with a fixed set of tools, those tool schemas can eat up thousands of tokens per request. Cache them by adding cache_control to the last tool in the array. Since tools sits at the top of the cache hierarchy, this caches all tool definitions as a single prefix.
| |
On the first call, all three tool schemas are written to cache. Every subsequent call with the same tool definitions reads them from cache. This is particularly valuable for agent loops where the same agent makes 5-20 calls per task – the tool definitions are only paid in full once.
Monitoring Cache Performance
Every API response includes a usage object with three input token fields that tell you exactly what happened with caching:
cache_creation_input_tokens– tokens written to cache on this request (you pay 1.25x base price)cache_read_input_tokens– tokens read from an existing cache entry (you pay 0.1x base price)input_tokens– tokens after the last cache breakpoint that aren’t cached (you pay 1x base price)
| |
The total input tokens for any request is:
| |
If cache_creation_input_tokens is non-zero on a repeat call, your prefix changed and a new cache entry was written. If cache_read_input_tokens is always 0, your content isn’t being cached – check the troubleshooting section below.
Common Errors and Fixes
cache_read_input_tokens is Always 0
Your prefix isn’t matching. Check these in order:
- Prefix changed between calls. Even whitespace or key ordering differences break the cache. If you’re building the system prompt dynamically (timestamps, random IDs), the prefix changes every time. Move dynamic content after the cache breakpoint.
- Below minimum tokens. The cached prefix must exceed the model’s minimum: 1,024 tokens for Sonnet 4, 4,096 for Opus 4.5. Short prompts are processed normally without caching and no error is raised.
- Cache expired. The default TTL is 5 minutes of inactivity. If requests are spaced further apart, the cache evicts. Use
"ttl": "1h"for slower workflows. - Concurrent first requests. A cache entry only becomes available after the first response begins streaming. If you fire 10 parallel requests before the first one returns, they all write to cache independently. Send one warm-up request, wait for the response to start, then send the rest.
AttributeError: ‘Beta’ object has no attribute ‘prompt_caching’
Prompt caching graduated from beta. Drop the beta prefix:
| |
Unstable JSON Key Ordering
Some languages randomize dictionary key order during JSON serialization (Go, Swift). If your tool definitions serialize differently on each request, the byte-level prefix changes and cache never hits. Sort your keys or use an ordered serialization format.
Cache Invalidation Surprises
Changing anything higher in the cache hierarchy invalidates everything below it. If you modify tool definitions, both the system prompt cache and message cache are invalidated. If you toggle features like web search or citations, the system prompt cache breaks. Keep your tools and system prompts stable to maximize cache hits.
Related Guides
- How to Use the Anthropic Token Counting API for Cost Estimation
- How to Use the Anthropic Tool Use API for Agentic Workflows
- How to Use the Anthropic Claude Files API for Large Document Processing
- How to Use the Anthropic Multi-Turn Conversation API with Tool Use
- How to Use the Anthropic Python SDK for Claude
- How to Use the Anthropic Claude Vision API for Image Understanding
- How to Use the Stability AI API for Image and Video Generation
- How to Use the AWS Bedrock Converse API for Multi-Model Chat
- How to Use the OpenAI Realtime API for Voice Applications
- How to Use the Cerebras API for Fast LLM Inference