Claude can think before it answers. The extended thinking feature gives Claude a scratchpad – a chain-of-thought reasoning space where it works through problems step by step before producing a final response. You get access to both the thinking process and the answer, which is incredibly useful for debugging, auditing, and building trust in complex outputs.
The feature works through the standard Messages API. You pass a thinking parameter, set a token budget, and Claude returns two types of content blocks: thinking blocks with its reasoning and text blocks with the final answer.
Enable Extended Thinking
Install the SDK and set your API key:
| |
Here’s the simplest extended thinking request:
| |
The thinking parameter takes two fields: type must be "enabled", and budget_tokens sets the upper limit on how many tokens Claude can spend reasoning. Claude won’t always use the full budget – it spends what the problem requires.
A few rules to keep in mind:
budget_tokensmust be less thanmax_tokens. The budget is a subset of the total token allocation.max_tokensneeds to be large enough to fit both the thinking and the response. For reasoning-heavy tasks, 16000 is a good starting point.- Extended thinking is available on
claude-sonnet-4-5-20250514and newer models.
Process Thinking and Text Blocks
The response contains a list of content blocks. Thinking blocks always come before text blocks. Here’s how to separate them cleanly:
| |
This is the classic cognitive reflection test. Without thinking, models often blurt out “$0.10” (wrong). With extended thinking, Claude sets up the algebra – if the ball costs x, then x + (x + 1.00) = 1.10, so 2x = 0.10, and x = 0.05. The answer is $0.05.
You can log or store the thinking blocks separately. They’re useful for debugging when the final answer is wrong, or for showing users that the model actually worked through the problem.
Practical Example: Code Review with Extended Thinking
Extended thinking shines when you need Claude to analyze code and catch subtle bugs. The reasoning space lets it trace execution paths, check edge cases, and build a mental model of the code before responding.
| |
The thinking blocks will show Claude tracing through the merge logic, checking the off-by-one in find_kth_smallest (it uses 0-based indexing via merged[k], which will throw an IndexError if k equals the total length), and reasoning about whether the merge assumes sorted input. This kind of step-by-step analysis catches bugs that a quick scan misses.
Streaming with Extended Thinking
For long reasoning tasks, streaming lets you show progress as Claude thinks. The stream emits thinking events first, then text events.
| |
The stream produces content_block_start events that tell you whether a new block is thinking or text. Then content_block_delta events carry the incremental content – thinking_delta for reasoning chunks and text_delta for answer chunks. This maps cleanly to a UI where you show a “thinking” spinner followed by the actual response.
Choosing the Right Budget
The budget_tokens value controls how much space Claude has to reason. Here’s how to think about it:
- 2000-5000 tokens: Simple math, straightforward logic, basic classification with explanations. Good for tasks where you want a quick sanity check.
- 5000-10000 tokens: Multi-step problems, code review, comparing options, analyzing arguments. This covers most practical use cases.
- 10000-30000 tokens: Complex mathematical proofs, lengthy code analysis, multi-constraint optimization, research synthesis.
Start with 10000 and adjust based on what you see in the thinking blocks. If Claude consistently uses less than half the budget, lower it. If the thinking gets cut off or feels rushed, bump it up.
The cost is real – thinking tokens count toward your output token usage. For high-volume workloads, profile a sample of requests to find the sweet spot between reasoning quality and cost.
Common Errors and Fixes
budget_tokens must be less than max_tokens
The budget is a subset of the total output. If you set max_tokens=8000 and budget_tokens=10000, you’ll get this error. Fix: make sure max_tokens is larger than budget_tokens by enough margin for the actual response.
| |
thinking is not supported for this model
Extended thinking requires specific models. You’ll get this error with older Claude models or the wrong model ID. Use claude-sonnet-4-5-20250514 or check the Anthropic docs for current model support.
Empty thinking blocks
If the problem is trivial, Claude might not produce meaningful thinking content. This is normal – the model decides how much reasoning a question needs. Don’t rely on thinking blocks always being present or substantial.
Streaming event ordering issues
When streaming, thinking_delta events always come before text_delta events within a response. If you’re building a UI, initialize your state machine to expect thinking first, then transition to text on the first text content block start event.
max_tokens too low for thinking + response
If Claude runs out of tokens mid-thought or mid-answer, you get a truncated response with stop_reason: "max_tokens". Check response.stop_reason – if it’s "max_tokens" instead of "end_turn", increase max_tokens.
| |
Temperature is not supported with extended thinking
You cannot set a temperature parameter when extended thinking is enabled. The API will return an error. Drop the temperature argument from your request when using thinking mode.
Related Guides
- How to Use the Anthropic Message Batches API for Async Workloads
- How to Use the Anthropic Batch API for High-Volume Processing
- How to Use the DeepSeek API for Code and Reasoning Tasks
- How to Use the Anthropic Token Streaming API for Real-Time UIs
- How to Use the xAI Grok API for Chat and Function Calling
- How to Use the Anthropic Citations API for Grounded Responses
- How to Use LiteLLM as a Universal LLM API Gateway
- How to Use the Anthropic Token Efficient Tool Use API
- How to Use the OpenRouter API for Multi-Provider LLM Access
- How to Use the Together AI API for Open-Source LLMs