Tool definitions are expensive. If you’re building an agent with 10+ tools, the JSON schemas alone can eat thousands of input tokens on every single API call. Anthropic’s token-efficient tool use feature fixes this by reducing the token overhead of tool definitions by roughly 50%. For Claude 3.7 Sonnet, you enable it with a beta header. For all Claude 4+ models, it’s built in – no header needed.
Here’s the fastest way to enable it on Claude 3.7 Sonnet:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
| from anthropic import Anthropic
client = Anthropic()
response = client.beta.messages.create(
model="claude-3-7-sonnet-20250219",
max_tokens=1024,
betas=["token-efficient-tools-2025-02-19"],
tools=[
{
"name": "get_weather",
"description": "Get the current weather for a location",
"input_schema": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City and state, e.g. San Francisco, CA"
}
},
"required": ["location"]
}
}
],
messages=[{"role": "user", "content": "What's the weather in Boston?"}]
)
print(response.content)
|
On Claude 4+ models (Sonnet 4, Opus 4.1, Sonnet 4.5, Opus 4.5, Sonnet 4.6, Opus 4.6), token-efficient tool use is enabled by default. Drop the beta header entirely and use the standard client.messages.create() endpoint.
When you define tools in an API request, Claude receives the full JSON schema for each tool – names, descriptions, parameter types, enums, nested objects. That schema gets serialized into tokens as part of the input. With a handful of tools, it’s manageable. With 15-20 tools (common in agentic systems), you can easily burn 3,000-5,000 tokens just on definitions before the model reads a single user message.
Token-efficient tool use compresses the internal representation of those schemas. The tool definitions you send in JSON stay exactly the same. What changes is how the API encodes them for the model. Anthropic reported up to 70% reduction in output token consumption, with an average of around 14% across early users. The input token count for tool schemas drops by roughly 50%.
The important part: your code doesn’t change at all (beyond adding the beta header for 3.7 Sonnet). Same tool schemas, same request format, same response structure.
Here’s a realistic agent setup with multiple tools. This works on Claude 4+ without any beta header:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
| from anthropic import Anthropic
import json
client = Anthropic()
tools = [
{
"name": "search_database",
"description": "Search a database of products by query string",
"input_schema": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "The search query"
},
"max_results": {
"type": "integer",
"description": "Maximum number of results to return (1-50)",
"default": 10
},
"category": {
"type": "string",
"enum": ["electronics", "clothing", "books", "home", "sports"],
"description": "Filter by product category"
}
},
"required": ["query"]
}
},
{
"name": "get_product_details",
"description": "Get detailed information about a specific product by ID",
"input_schema": {
"type": "object",
"properties": {
"product_id": {
"type": "string",
"description": "The unique product identifier"
}
},
"required": ["product_id"]
}
},
{
"name": "check_inventory",
"description": "Check real-time inventory for a product at a specific warehouse",
"input_schema": {
"type": "object",
"properties": {
"product_id": {
"type": "string",
"description": "The product identifier"
},
"warehouse_id": {
"type": "string",
"description": "The warehouse location code"
}
},
"required": ["product_id", "warehouse_id"]
}
},
{
"name": "create_order",
"description": "Place an order for a product",
"input_schema": {
"type": "object",
"properties": {
"product_id": {"type": "string", "description": "Product to order"},
"quantity": {"type": "integer", "description": "Number of units"},
"shipping_address": {"type": "string", "description": "Delivery address"}
},
"required": ["product_id", "quantity", "shipping_address"]
}
}
]
# Initial request
response = client.messages.create(
model="claude-sonnet-4-5-20250929",
max_tokens=1024,
tools=tools,
messages=[
{"role": "user", "content": "Find me a good wireless keyboard under $50 and check if it's in stock at warehouse WH-001."}
]
)
# Check token usage
print(f"Input tokens: {response.usage.input_tokens}")
print(f"Output tokens: {response.usage.output_tokens}")
# Handle tool calls in a loop
messages = [{"role": "user", "content": "Find me a good wireless keyboard under $50 and check if it's in stock at warehouse WH-001."}]
while response.stop_reason == "tool_use":
# Collect all tool use blocks from the response
tool_results = []
for block in response.content:
if block.type == "tool_use":
# Simulate tool execution (replace with real implementations)
if block.name == "search_database":
result = json.dumps([{"id": "KB-2049", "name": "ProType Wireless K1", "price": 39.99}])
elif block.name == "check_inventory":
result = json.dumps({"in_stock": True, "quantity": 142})
elif block.name == "get_product_details":
result = json.dumps({"id": "KB-2049", "name": "ProType Wireless K1", "price": 39.99, "rating": 4.6})
else:
result = json.dumps({"status": "ok"})
tool_results.append({
"type": "tool_result",
"tool_use_id": block.id,
"content": result
})
messages.append({"role": "assistant", "content": response.content})
messages.append({"role": "user", "content": tool_results})
response = client.messages.create(
model="claude-sonnet-4-5-20250929",
max_tokens=1024,
tools=tools,
messages=messages
)
# Final text response
for block in response.content:
if hasattr(block, "text"):
print(block.text)
|
Each iteration of that loop sends the full tool definitions again. With 4 tools, that’s a few hundred tokens. Scale to 15 tools with complex schemas and it adds up fast.
Comparing Token Counts#
To see the actual difference, use the token counting API to measure with and without the beta on Claude 3.7 Sonnet:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
| from anthropic import Anthropic
client = Anthropic()
tools = [
{
"name": "search_web",
"description": "Search the web for information",
"input_schema": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "Search query"},
"num_results": {"type": "integer", "description": "Number of results"}
},
"required": ["query"]
}
},
{
"name": "read_url",
"description": "Read the content of a URL",
"input_schema": {
"type": "object",
"properties": {
"url": {"type": "string", "description": "The URL to read"}
},
"required": ["url"]
}
},
{
"name": "run_code",
"description": "Execute Python code in a sandbox",
"input_schema": {
"type": "object",
"properties": {
"code": {"type": "string", "description": "Python code to execute"},
"timeout": {"type": "integer", "description": "Timeout in seconds", "default": 30}
},
"required": ["code"]
}
}
]
messages = [{"role": "user", "content": "Search for the latest Python release notes."}]
# Standard token count (without beta)
standard_count = client.messages.count_tokens(
model="claude-3-7-sonnet-20250219",
tools=tools,
messages=messages
)
print(f"Standard input tokens: {standard_count.input_tokens}")
# Token-efficient count (with beta)
efficient_count = client.beta.messages.count_tokens(
model="claude-3-7-sonnet-20250219",
betas=["token-efficient-tools-2025-02-19"],
tools=tools,
messages=messages
)
print(f"Token-efficient input tokens: {efficient_count.input_tokens}")
savings = standard_count.input_tokens - efficient_count.input_tokens
pct = (savings / standard_count.input_tokens) * 100
print(f"Saved {savings} tokens ({pct:.1f}%)")
|
With 3 tools like above, you’ll typically see a 40-60% reduction in the tokens consumed by tool schemas. The more tools and the more complex their schemas, the bigger the savings.
Combining with Prompt Caching for Maximum Savings#
Token-efficient tool use shrinks the per-token cost of tool definitions. Prompt caching eliminates the cost of re-processing them entirely on repeat calls. Stack both for the best results.
The trick: add cache_control to your last tool definition. This caches all tool schemas as a single prefix. On subsequent requests within the 5-minute TTL, those tools get read from cache at 90% off the base input price.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
| from anthropic import Anthropic
client = Anthropic()
tools = [
{
"name": "query_knowledge_base",
"description": "Search the internal knowledge base for relevant articles",
"input_schema": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "Search query"},
"filters": {
"type": "object",
"properties": {
"category": {"type": "string"},
"date_after": {"type": "string", "description": "ISO date"}
}
}
},
"required": ["query"]
}
},
{
"name": "create_ticket",
"description": "Create a support ticket",
"input_schema": {
"type": "object",
"properties": {
"title": {"type": "string"},
"description": {"type": "string"},
"priority": {"type": "string", "enum": ["low", "medium", "high", "critical"]},
"assignee": {"type": "string"}
},
"required": ["title", "description", "priority"]
},
"cache_control": {"type": "ephemeral"} # Caches ALL tools above this point too
}
]
system = [
{
"type": "text",
"text": "You are a support agent. Use the knowledge base to answer questions. Create tickets for issues you cannot resolve directly.",
"cache_control": {"type": "ephemeral"}
}
]
# First request: cache write (25% surcharge on cached tokens)
response = client.messages.create(
model="claude-sonnet-4-5-20250929",
max_tokens=1024,
tools=tools,
system=system,
messages=[{"role": "user", "content": "My dashboard is showing stale data since yesterday."}]
)
print(f"Cache write tokens: {response.usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {response.usage.cache_read_input_tokens}")
# Second request within 5 minutes: cache hit (90% discount on cached tokens)
response2 = client.messages.create(
model="claude-sonnet-4-5-20250929",
max_tokens=1024,
tools=tools,
system=system,
messages=[{"role": "user", "content": "I can't log in to my account."}]
)
print(f"Cache write tokens: {response2.usage.cache_creation_input_tokens}")
print(f"Cache read tokens: {response2.usage.cache_read_input_tokens}")
|
On the second call, the tool definitions and system prompt are read from cache. Combined with the already-reduced token count from efficient encoding, your effective cost for tool schemas drops dramatically.
For agents making many calls per minute (think: agentic loops with 10+ iterations), this combination can cut your total input token costs by 60-80% compared to raw tool definitions without either optimization.
Common Errors and Fixes#
error: "betas" is not supported for this model
You’re sending betas=["token-efficient-tools-2025-02-19"] to a Claude 4+ model. Remove the beta header entirely. Token-efficient tool use is built into all Claude 4 models by default.
1
2
3
4
5
6
7
8
9
10
11
12
| # Wrong -- don't use betas with Claude 4+ models
response = client.beta.messages.create(
model="claude-sonnet-4-5-20250929",
betas=["token-efficient-tools-2025-02-19"],
...
)
# Correct -- Claude 4+ has it built in
response = client.messages.create(
model="claude-sonnet-4-5-20250929",
...
)
|
AttributeError: 'Beta' object has no attribute 'prompt_caching'
You’re using the old beta prompt caching namespace. Prompt caching is GA now. Use client.messages.create() directly with cache_control blocks in your request body.
cache_creation_input_tokens is always 0 and cache_read_input_tokens is always 0
Your cached content is too short. The minimum cacheable prompt length varies by model:
- Claude Opus 4.5/4.6: 4,096 tokens minimum
- Claude Sonnet 4/4.5/4.6, Opus 4/4.1: 1,024 tokens minimum
- Claude Haiku 4.5: 4,096 tokens minimum
If your tool definitions plus system prompt don’t meet the minimum, caching silently does nothing. Add more static content to your system prompt or accept that small tool sets won’t benefit from caching.
tool_use blocks returned but stop_reason is "end_turn" instead of "tool_use"
This happens when the model generates a tool call but also decides to include text. Check response.stop_reason and iterate through response.content blocks regardless. A response can contain both text and tool_use blocks.
Token counts not decreasing on Claude 3.7 Sonnet
Make sure you’re using client.beta.messages.create() (not client.messages.create()) when passing the betas parameter. The standard endpoint ignores the beta header.
Which Approach to Use#
For Claude 3.7 Sonnet projects: add betas=["token-efficient-tools-2025-02-19"] and use client.beta.messages.create(). Combine with cache_control on your last tool definition for repeat calls.
For Claude 4+ projects: you already have token-efficient tool use. Focus on prompt caching with cache_control blocks. That’s where the remaining savings are.
For high-volume agentic workloads: stack both optimizations, use the 1-hour cache TTL if your agent loops take longer than 5 minutes ("cache_control": {"type": "ephemeral", "ttl": "1h"}), and monitor cache_read_input_tokens in your responses to verify caching is working.