Claude can think before it answers. The extended thinking feature gives Claude a scratchpad – a chain-of-thought reasoning space where it works through problems step by step before producing a final response. You get access to both the thinking process and the answer, which is incredibly useful for debugging, auditing, and building trust in complex outputs.

The feature works through the standard Messages API. You pass a thinking parameter, set a token budget, and Claude returns two types of content blocks: thinking blocks with its reasoning and text blocks with the final answer.

Enable Extended Thinking

Install the SDK and set your API key:

1
2
pip install anthropic
export ANTHROPIC_API_KEY="sk-ant-..."

Here’s the simplest extended thinking request:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-5-20250514",
    max_tokens=16000,
    thinking={
        "type": "enabled",
        "budget_tokens": 10000
    },
    messages=[
        {
            "role": "user",
            "content": "How many r's are in the word 'strawberry'?"
        }
    ]
)

for block in response.content:
    if block.type == "thinking":
        print(f"Thinking:\n{block.thinking}\n")
    elif block.type == "text":
        print(f"Answer:\n{block.text}")

The thinking parameter takes two fields: type must be "enabled", and budget_tokens sets the upper limit on how many tokens Claude can spend reasoning. Claude won’t always use the full budget – it spends what the problem requires.

A few rules to keep in mind:

  • budget_tokens must be less than max_tokens. The budget is a subset of the total token allocation.
  • max_tokens needs to be large enough to fit both the thinking and the response. For reasoning-heavy tasks, 16000 is a good starting point.
  • Extended thinking is available on claude-sonnet-4-5-20250514 and newer models.

Process Thinking and Text Blocks

The response contains a list of content blocks. Thinking blocks always come before text blocks. Here’s how to separate them cleanly:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-sonnet-4-5-20250514",
    max_tokens=16000,
    thinking={
        "type": "enabled",
        "budget_tokens": 8000
    },
    messages=[
        {
            "role": "user",
            "content": "A bat and a ball cost $1.10 together. The bat costs $1.00 more than the ball. How much does the ball cost?"
        }
    ]
)

thinking_text = ""
answer_text = ""

for block in response.content:
    if block.type == "thinking":
        thinking_text += block.thinking
    elif block.type == "text":
        answer_text += block.text

print(f"Input tokens: {response.usage.input_tokens}")
print(f"Total output tokens: {response.usage.output_tokens}")
print(f"\n--- Reasoning ---\n{thinking_text}")
print(f"\n--- Final Answer ---\n{answer_text}")

This is the classic cognitive reflection test. Without thinking, models often blurt out “$0.10” (wrong). With extended thinking, Claude sets up the algebra – if the ball costs x, then x + (x + 1.00) = 1.10, so 2x = 0.10, and x = 0.05. The answer is $0.05.

You can log or store the thinking blocks separately. They’re useful for debugging when the final answer is wrong, or for showing users that the model actually worked through the problem.

Practical Example: Code Review with Extended Thinking

Extended thinking shines when you need Claude to analyze code and catch subtle bugs. The reasoning space lets it trace execution paths, check edge cases, and build a mental model of the code before responding.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import anthropic

client = anthropic.Anthropic()

code_to_review = """
def merge_sorted_lists(list_a, list_b):
    result = []
    i, j = 0, 0
    while i < len(list_a) and j < len(list_b):
        if list_a[i] <= list_b[j]:
            result.append(list_a[i])
            i += 1
        else:
            result.append(list_b[j])
            j += 1
    result += list_a[i:]
    result += list_b[j:]
    return result

def merge_sort(arr):
    if len(arr) <= 1:
        return arr
    mid = len(arr) // 2
    left = merge_sort(arr[:mid])
    right = merge_sort(arr[mid:])
    return merge_sorted_lists(left, right)

def find_kth_smallest(arrays, k):
    merged = []
    for arr in arrays:
        merged = merge_sorted_lists(merged, arr)
    return merged[k]
"""

response = client.messages.create(
    model="claude-sonnet-4-5-20250514",
    max_tokens=16000,
    thinking={
        "type": "enabled",
        "budget_tokens": 10000
    },
    messages=[
        {
            "role": "user",
            "content": f"Review this Python code for bugs, edge cases, and performance issues. Be specific about line-level problems.\n\n```python\n{code_to_review}\n```"
        }
    ]
)

for block in response.content:
    if block.type == "thinking":
        print("=== Claude's Reasoning ===")
        print(block.thinking[:500] + "..." if len(block.thinking) > 500 else block.thinking)
        print()
    elif block.type == "text":
        print("=== Review Results ===")
        print(block.text)

The thinking blocks will show Claude tracing through the merge logic, checking the off-by-one in find_kth_smallest (it uses 0-based indexing via merged[k], which will throw an IndexError if k equals the total length), and reasoning about whether the merge assumes sorted input. This kind of step-by-step analysis catches bugs that a quick scan misses.

Streaming with Extended Thinking

For long reasoning tasks, streaming lets you show progress as Claude thinks. The stream emits thinking events first, then text events.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import anthropic

client = anthropic.Anthropic()

print("--- Thinking ---")
thinking_started = False
text_started = False

with client.messages.stream(
    model="claude-sonnet-4-5-20250514",
    max_tokens=16000,
    thinking={
        "type": "enabled",
        "budget_tokens": 10000
    },
    messages=[
        {
            "role": "user",
            "content": "Solve step by step: If a train travels 120 km in 2 hours, stops for 30 minutes, then travels 90 km in 1.5 hours, what is the average speed for the entire journey including the stop?"
        }
    ]
) as stream:
    for event in stream:
        if event.type == "content_block_start":
            if hasattr(event.content_block, "type"):
                if event.content_block.type == "thinking" and not thinking_started:
                    thinking_started = True
                elif event.content_block.type == "text" and not text_started:
                    text_started = True
                    print("\n\n--- Answer ---")

        elif event.type == "content_block_delta":
            if event.delta.type == "thinking_delta":
                print(event.delta.thinking, end="", flush=True)
            elif event.delta.type == "text_delta":
                print(event.delta.text, end="", flush=True)

print()

The stream produces content_block_start events that tell you whether a new block is thinking or text. Then content_block_delta events carry the incremental content – thinking_delta for reasoning chunks and text_delta for answer chunks. This maps cleanly to a UI where you show a “thinking” spinner followed by the actual response.

Choosing the Right Budget

The budget_tokens value controls how much space Claude has to reason. Here’s how to think about it:

  • 2000-5000 tokens: Simple math, straightforward logic, basic classification with explanations. Good for tasks where you want a quick sanity check.
  • 5000-10000 tokens: Multi-step problems, code review, comparing options, analyzing arguments. This covers most practical use cases.
  • 10000-30000 tokens: Complex mathematical proofs, lengthy code analysis, multi-constraint optimization, research synthesis.

Start with 10000 and adjust based on what you see in the thinking blocks. If Claude consistently uses less than half the budget, lower it. If the thinking gets cut off or feels rushed, bump it up.

The cost is real – thinking tokens count toward your output token usage. For high-volume workloads, profile a sample of requests to find the sweet spot between reasoning quality and cost.

Common Errors and Fixes

budget_tokens must be less than max_tokens

The budget is a subset of the total output. If you set max_tokens=8000 and budget_tokens=10000, you’ll get this error. Fix: make sure max_tokens is larger than budget_tokens by enough margin for the actual response.

1
2
3
4
5
6
7
# Wrong -- budget exceeds max_tokens
thinking={"type": "enabled", "budget_tokens": 10000}
# with max_tokens=8000

# Right -- leave room for the answer
thinking={"type": "enabled", "budget_tokens": 10000}
# with max_tokens=16000

thinking is not supported for this model

Extended thinking requires specific models. You’ll get this error with older Claude models or the wrong model ID. Use claude-sonnet-4-5-20250514 or check the Anthropic docs for current model support.

Empty thinking blocks

If the problem is trivial, Claude might not produce meaningful thinking content. This is normal – the model decides how much reasoning a question needs. Don’t rely on thinking blocks always being present or substantial.

Streaming event ordering issues

When streaming, thinking_delta events always come before text_delta events within a response. If you’re building a UI, initialize your state machine to expect thinking first, then transition to text on the first text content block start event.

max_tokens too low for thinking + response

If Claude runs out of tokens mid-thought or mid-answer, you get a truncated response with stop_reason: "max_tokens". Check response.stop_reason – if it’s "max_tokens" instead of "end_turn", increase max_tokens.

1
2
if response.stop_reason == "max_tokens":
    print("Warning: response was truncated. Increase max_tokens.")

Temperature is not supported with extended thinking

You cannot set a temperature parameter when extended thinking is enabled. The API will return an error. Drop the temperature argument from your request when using thinking mode.