Every time you send a large system prompt, a 50-page document, or a set of tool definitions to the Claude API, you’re paying full price for tokens the model has already processed. Anthropic’s prompt caching lets you store those static prefixes and reuse them across requests. Cached reads cost just 10% of the base input token price – a 90% discount. The tradeoff is a 25% surcharge on the first request that writes the cache. For any workload where the same context appears more than twice, you come out ahead.

How Prompt Caching Works

You mark content blocks with "cache_control": {"type": "ephemeral"} to create cache breakpoints. On the first request, the API processes the full prompt and writes everything up to the breakpoint into cache. On subsequent requests with an identical prefix, the API reads from cache instead of reprocessing.

Key details:

  • Cache hierarchy: The API caches in order – tools, then system, then messages. Changing anything earlier in the chain invalidates everything after it.
  • Minimum token requirements: The cached prefix must meet a minimum size or caching silently skips. For Claude Sonnet 4 and Opus 4.1, the minimum is 1,024 tokens. For Claude Opus 4.5, it’s 4,096 tokens. Claude Haiku 3 requires 2,048 tokens.
  • TTL: Cached entries last 5 minutes by default, refreshed each time they’re read. Anthropic also offers a 1-hour TTL at 2x the base input price for cache writes (instead of 1.25x). Set it with "cache_control": {"type": "ephemeral", "ttl": "1h"}.
  • Breakpoint limit: You can set up to 4 cache_control breakpoints per request.
  • Automatic lookback: The system checks up to 20 blocks before each breakpoint for cache hits, working backwards to find the longest matching prefix.

Caching a Large System Prompt

The most common pattern is caching a large system prompt that stays the same across requests. Pass the system prompt as an array of content blocks and add cache_control on the last block you want cached.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import anthropic

client = anthropic.Anthropic()

# Load a large reference document
with open("company_handbook.txt") as f:
    handbook_text = f.read()

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are an HR assistant. Answer employee questions using the company handbook below.",
        },
        {
            "type": "text",
            "text": handbook_text,
            "cache_control": {"type": "ephemeral"},
        },
    ],
    messages=[
        {"role": "user", "content": "What is the parental leave policy?"}
    ],
)

print(response.content[0].text)

The first request writes the system prompt to cache. Every subsequent request with the same system prompt prefix reads from cache at 10% cost. Your user message can change freely – only the prefix up to the cache_control breakpoint needs to stay identical.

Caching Documents for Multi-Turn Conversations

In a multi-turn chat where users ask questions about a document, you want the document cached across the entire conversation. Place the document in the system prompt with cache_control, then add a second breakpoint on the last message in the conversation to incrementally cache the growing history.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
import anthropic

client = anthropic.Anthropic()

contract_text = open("contract.txt").read()  # large document, 30k+ tokens

# Turn 1: User asks first question
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a contract analyst. Use the contract below to answer questions.",
        },
        {
            "type": "text",
            "text": contract_text,
            "cache_control": {"type": "ephemeral"},
        },
    ],
    messages=[
        {"role": "user", "content": "What are the payment terms?"}
    ],
)

answer_1 = response.content[0].text
print(f"Turn 1 - Cache written: {response.usage.cache_creation_input_tokens}")
print(f"Turn 1 - Cache read:    {response.usage.cache_read_input_tokens}")

# Turn 2: Follow-up question with conversation history
response_2 = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a contract analyst. Use the contract below to answer questions.",
        },
        {
            "type": "text",
            "text": contract_text,
            "cache_control": {"type": "ephemeral"},
        },
    ],
    messages=[
        {"role": "user", "content": "What are the payment terms?"},
        {"role": "assistant", "content": answer_1},
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "Are there any late payment penalties?",
                    "cache_control": {"type": "ephemeral"},
                }
            ],
        },
    ],
)

print(f"Turn 2 - Cache written: {response_2.usage.cache_creation_input_tokens}")
print(f"Turn 2 - Cache read:    {response_2.usage.cache_read_input_tokens}")

On turn 2, the system prompt hits the cache (you’ll see cache_read_input_tokens populated), and the conversation up to the second breakpoint gets written to a new cache entry. On turn 3, both the system prompt and the first two turns get read from cache. Each subsequent turn only pays full price for the new user message.

Caching Tool Definitions

If you’re building an agent with a fixed set of tools, those tool schemas can eat up thousands of tokens per request. Cache them by adding cache_control to the last tool in the array. Since tools sits at the top of the cache hierarchy, this caches all tool definitions as a single prefix.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import anthropic

client = anthropic.Anthropic()

tools = [
    {
        "name": "search_knowledge_base",
        "description": "Search the internal knowledge base for relevant articles",
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": "The search query to find relevant articles",
                },
                "max_results": {
                    "type": "integer",
                    "description": "Maximum number of results to return (1-20)",
                },
            },
            "required": ["query"],
        },
    },
    {
        "name": "create_ticket",
        "description": "Create a support ticket in the ticketing system",
        "input_schema": {
            "type": "object",
            "properties": {
                "title": {"type": "string", "description": "Ticket title"},
                "priority": {
                    "type": "string",
                    "enum": ["low", "medium", "high", "critical"],
                    "description": "Ticket priority level",
                },
                "description": {
                    "type": "string",
                    "description": "Detailed description of the issue",
                },
                "assignee": {
                    "type": "string",
                    "description": "Email of the person to assign the ticket to",
                },
            },
            "required": ["title", "priority", "description"],
        },
    },
    {
        "name": "get_customer_info",
        "description": "Look up customer account details by email or account ID",
        "input_schema": {
            "type": "object",
            "properties": {
                "email": {"type": "string", "description": "Customer email address"},
                "account_id": {"type": "string", "description": "Customer account ID"},
            },
        },
        # Mark the last tool to cache all tool definitions
        "cache_control": {"type": "ephemeral"},
    },
]

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    tools=tools,
    system="You are a customer support agent. Use the available tools to help resolve customer issues efficiently.",
    messages=[
        {"role": "user", "content": "Customer [email protected] says they can't log in."}
    ],
)

# The model may call tools -- handle tool_use blocks
for block in response.content:
    if block.type == "tool_use":
        print(f"Tool call: {block.name}({block.input})")
    elif block.type == "text":
        print(f"Response: {block.text}")

print(f"\nCache written: {response.usage.cache_creation_input_tokens}")
print(f"Cache read:    {response.usage.cache_read_input_tokens}")

On the first call, all three tool schemas are written to cache. Every subsequent call with the same tool definitions reads them from cache. This is particularly valuable for agent loops where the same agent makes 5-20 calls per task – the tool definitions are only paid in full once.

Monitoring Cache Performance

Every API response includes a usage object with three input token fields that tell you exactly what happened with caching:

  • cache_creation_input_tokens – tokens written to cache on this request (you pay 1.25x base price)
  • cache_read_input_tokens – tokens read from an existing cache entry (you pay 0.1x base price)
  • input_tokens – tokens after the last cache breakpoint that aren’t cached (you pay 1x base price)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import anthropic

client = anthropic.Anthropic()

large_context = "Detailed technical specification document... " * 500  # large text

params = {
    "model": "claude-sonnet-4-20250514",
    "max_tokens": 256,
    "system": [
        {
            "type": "text",
            "text": large_context,
            "cache_control": {"type": "ephemeral"},
        }
    ],
    "messages": [{"role": "user", "content": "Summarize the key requirements."}],
}

# First call -- writes cache
r1 = client.messages.create(**params)
print("--- Request 1 (cache write) ---")
print(f"  cache_creation_input_tokens: {r1.usage.cache_creation_input_tokens}")
print(f"  cache_read_input_tokens:     {r1.usage.cache_read_input_tokens}")
print(f"  input_tokens:                {r1.usage.input_tokens}")

# Second call -- reads cache
r2 = client.messages.create(**params)
print("\n--- Request 2 (cache hit) ---")
print(f"  cache_creation_input_tokens: {r2.usage.cache_creation_input_tokens}")
print(f"  cache_read_input_tokens:     {r2.usage.cache_read_input_tokens}")
print(f"  input_tokens:                {r2.usage.input_tokens}")

# Calculate cost savings (Claude Sonnet 4 pricing: $3/MTok base input)
base_price_per_mtok = 3.00
write_tokens = r1.usage.cache_creation_input_tokens
read_tokens = r2.usage.cache_read_input_tokens

cost_without_cache = (write_tokens + read_tokens) * base_price_per_mtok / 1_000_000
cost_with_cache = (
    write_tokens * base_price_per_mtok * 1.25 / 1_000_000  # write surcharge
    + read_tokens * base_price_per_mtok * 0.10 / 1_000_000  # read discount
)
savings_pct = (1 - cost_with_cache / cost_without_cache) * 100

print(f"\nCost without caching: ${cost_without_cache:.4f}")
print(f"Cost with caching:    ${cost_with_cache:.4f}")
print(f"Savings:              {savings_pct:.1f}%")

The total input tokens for any request is:

1
total = cache_creation_input_tokens + cache_read_input_tokens + input_tokens

If cache_creation_input_tokens is non-zero on a repeat call, your prefix changed and a new cache entry was written. If cache_read_input_tokens is always 0, your content isn’t being cached – check the troubleshooting section below.

Common Errors and Fixes

cache_read_input_tokens is Always 0

Your prefix isn’t matching. Check these in order:

  1. Prefix changed between calls. Even whitespace or key ordering differences break the cache. If you’re building the system prompt dynamically (timestamps, random IDs), the prefix changes every time. Move dynamic content after the cache breakpoint.
  2. Below minimum tokens. The cached prefix must exceed the model’s minimum: 1,024 tokens for Sonnet 4, 4,096 for Opus 4.5. Short prompts are processed normally without caching and no error is raised.
  3. Cache expired. The default TTL is 5 minutes of inactivity. If requests are spaced further apart, the cache evicts. Use "ttl": "1h" for slower workflows.
  4. Concurrent first requests. A cache entry only becomes available after the first response begins streaming. If you fire 10 parallel requests before the first one returns, they all write to cache independently. Send one warm-up request, wait for the response to start, then send the rest.

AttributeError: ‘Beta’ object has no attribute ‘prompt_caching’

Prompt caching graduated from beta. Drop the beta prefix:

1
2
3
4
5
# Old (broken)
response = client.beta.prompt_caching.messages.create(**params)

# Current
response = client.messages.create(**params)

Unstable JSON Key Ordering

Some languages randomize dictionary key order during JSON serialization (Go, Swift). If your tool definitions serialize differently on each request, the byte-level prefix changes and cache never hits. Sort your keys or use an ordered serialization format.

Cache Invalidation Surprises

Changing anything higher in the cache hierarchy invalidates everything below it. If you modify tool definitions, both the system prompt cache and message cache are invalidated. If you toggle features like web search or citations, the system prompt cache breaks. Keep your tools and system prompts stable to maximize cache hits.