How to Use Prompt Caching to Cut LLM API Costs

If you’re sending the same system prompt, tool definitions, or document context on every API call, you’re paying full price for tokens the model has already seen. Prompt caching fixes this. Anthropic’s cache reads cost 10% of the base input price. OpenAI gives a 50% discount on cached tokens automatically. Either way, the savings on high-volume workloads are significant.

Here’s what you need to know to implement it on both platforms.

How Prompt Caching Works

The core idea is simple: the API stores a prefix of your prompt after the first request. On subsequent requests with an identical prefix, the provider reuses the stored representation instead of reprocessing those tokens from scratch. You pay a reduced rate for the cached portion and get faster time-to-first-token.

There are two approaches across providers:

Anthropic (explicit): You mark content blocks with cache_control to tell the API what to cache. You get fine-grained control over caching behavior with up to 4 breakpoints.
OpenAI (automatic): Caching kicks in automatically for prompts over 1024 tokens. No code changes needed – the system caches the longest matching prefix behind the scenes.

Both require the prompt prefix to be byte-for-byte identical across requests. Even a whitespace change breaks the cache.

Anthropic: Explicit Cache Control

With the Anthropic API, you add "cache_control": {"type": "ephemeral"} to the content blocks you want cached. The cache hierarchy follows the order: tools -> system -> messages. Anything before the cache breakpoint gets cached as a unit.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import anthropic

client = anthropic.Anthropic()

# First request: writes the system prompt to cache
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a legal document analyst. Follow these rules strictly...",
        },
        {
            "type": "text",
            "text": "<full_contract_text>... 50 pages of legal text ...</full_contract_text>",
            "cache_control": {"type": "ephemeral"},
        },
    ],
    messages=[
        {"role": "user", "content": "What are the termination clauses?"}
    ],
)

# Check cache performance
usage = response.usage
print(f"Cache written:  {usage.cache_creation_input_tokens} tokens")
print(f"Cache read:     {usage.cache_read_input_tokens} tokens")
print(f"Uncached input: {usage.input_tokens} tokens")

The first call writes to cache and you’ll see cache_creation_input_tokens populated. The second call with the same prefix hits cache, and cache_read_input_tokens shows the reused tokens:

1
2
3
4
5
// First request
{"cache_creation_input_tokens": 52400, "cache_read_input_tokens": 0, "input_tokens": 18}

// Second request (within 5 minutes)
{"cache_creation_input_tokens": 0, "cache_read_input_tokens": 52400, "input_tokens": 18}

Anthropic Pricing Math

Cache writes cost 1.25x the base input price. Cache reads cost 0.1x. For Claude Sonnet 4 at $3/MTok base input:

Token type	Cost per MTok
Base input	$3.00
Cache write	$3.75
Cache read	$0.30

If you cache 50,000 tokens and read from cache 20 times before it expires, you pay $3.75 for the write and $0.30 x 20 = $6.00 for the reads. Without caching, you’d pay $3.00 x 21 = $63.00. That’s a 84% cost reduction.

Minimum Token Requirements

Not every prompt qualifies for caching. The minimums vary by model:

Claude Sonnet 4 / Opus 4.1: 1,024 tokens
Claude Opus 4.5 / Opus 4.6: 4,096 tokens
Claude Haiku 4.5: 4,096 tokens
Claude Haiku 3: 2,048 tokens

If your cached prefix is shorter than the threshold, the request processes normally without caching – no error, just no cache.

OpenAI: Automatic Caching

OpenAI’s approach requires zero code changes. Any prompt over 1,024 tokens gets automatically cached. The system stores the longest matching prefix and serves it on subsequent identical requests.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from openai import OpenAI

client = OpenAI()

# The system prompt + context is automatically cached if over 1024 tokens
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "system",
            "content": "You are a code review assistant. Here are the project guidelines:\n"
                       + open("GUIDELINES.md").read()  # large static context
        },
        {"role": "user", "content": "Review this pull request diff:\n" + diff_text},
    ],
)

# Check if caching happened
cached = response.usage.prompt_tokens_details.cached_tokens
total = response.usage.prompt_tokens
print(f"Total prompt tokens: {total}")
print(f"Cached tokens:       {cached}")
print(f"Cache hit rate:      {cached / total * 100:.1f}%")

When caching works, response.usage.prompt_tokens_details.cached_tokens will be a non-zero value. Cached tokens cost 50% of the regular input price – not as steep a discount as Anthropic’s 90%, but it requires no opt-in or code changes.

OpenAI Cache Duration

OpenAI caches are kept for 5-10 minutes of inactivity and always cleared within one hour. For sustained workloads, the cache stays warm naturally. For bursty traffic, you may need a warm-up request.

Multi-Turn Conversations

Prompt caching really shines in multi-turn conversations where the system prompt and conversation history grow with each turn. With Anthropic, place cache_control on the last message of the conversation so far:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
messages = [
    {"role": "user", "content": "Summarize the Q1 earnings report."},
    {"role": "assistant", "content": "Revenue grew 15% YoY to $2.4B..."},
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What about operating margins?"},
            # This is intentionally missing cache_control -- see below
        ],
    },
]

# Add cache_control to the second-to-last user message
# so prior conversation is cached when you add the next turn
messages[0] = {
    "role": "user",
    "content": [
        {
            "type": "text",
            "text": "Summarize the Q1 earnings report.",
            "cache_control": {"type": "ephemeral"},
        }
    ],
}

Each subsequent turn reuses the cached prefix of the conversation, so you only pay full price for the new turn’s tokens. In a 20-turn conversation with a 4,000-token system prompt, this saves you from reprocessing ~80,000+ cumulative tokens.

Common Errors and Fixes

“AttributeError: ‘Beta’ object has no attribute ‘prompt_caching’”

This happens when using outdated Anthropic SDK code. Prompt caching left beta – drop the beta prefix:

1
2
3
4
5
# Wrong (old beta API)
response = client.beta.prompt_caching.messages.create(**params)

# Correct (current API)
response = client.messages.create(**params)

Cache Misses When You Expect Hits

If cache_read_input_tokens stays at 0 on follow-up calls, check these:

Prompt changed: Even a single whitespace difference breaks the cache. Ensure your cached prefix is byte-identical across requests.
Cache expired: Default TTL is 5 minutes of inactivity. If your requests are more than 5 minutes apart, the cache is gone. Use Anthropic’s 1-hour TTL option ("ttl": "1h") for slower workflows.
Below minimum tokens: Your cached content must exceed the model’s minimum threshold (1,024 or 4,096 depending on model).
Concurrent requests: A cache entry only becomes available after the first response starts generating. If you fire 10 parallel requests before the first one returns, none of them benefit from caching. Send one warm-up request first, wait for the response to begin, then send the batch.
Unstable JSON key ordering: Some languages (Go, Swift) randomize dictionary key order during JSON serialization. If your tool definitions serialize differently each time, the cache breaks every request.

OpenAI Shows 0 Cached Tokens

OpenAI requires at least 1,024 prompt tokens for caching to activate. If your prompt is shorter, cached_tokens will always be 0. Also, OpenAI caches are not guaranteed – they’re best-effort and depend on server-side routing.

When Prompt Caching Pays Off

Caching makes the biggest difference when:

Large static context: System prompts, RAG documents, or tool definitions over 1,024 tokens that repeat across calls.
Multi-turn conversations: Each turn reuses the growing conversation prefix.
Batch processing: Running the same prompt template against many inputs (e.g., analyzing 500 legal contracts with the same instructions).
Agent loops: Tool-using agents that make 5-20 API calls per task with identical system prompts and tool definitions.

It’s less useful for one-shot requests with unique prompts or very short system instructions under the minimum token threshold.

How Prompt Caching Works#

Anthropic: Explicit Cache Control#

Anthropic Pricing Math#

Minimum Token Requirements#

OpenAI: Automatic Caching#

OpenAI Cache Duration#

Multi-Turn Conversations#

Common Errors and Fixes#

“AttributeError: ‘Beta’ object has no attribute ‘prompt_caching’”#

Cache Misses When You Expect Hits#

OpenAI Shows 0 Cached Tokens#

When Prompt Caching Pays Off#

Related Guides#

About the Author