How to Use Claude Sonnet 4.6's 1M Token Context Window for Long-Document Reasoning

Claude Sonnet 4.6 ships with a 1M token context window in beta — roughly 750,000 words or a 3.4M-character block of Unicode. That’s enough to drop in an entire multi-repo codebase, a year’s worth of financial filings, or a full clinical trial dataset and get coherent, precise answers without chunking anything.

This guide covers the exact API calls, prompt structure tricks, and cost controls you need to use it well.

Enabling the 1M Context Window

The standard context window for Sonnet 4.6 is 200K tokens. To push it to 1M, you pass a beta header. The beta flag is context-1m-2025-08-07. You also need to be on API usage tier 4 or have a custom rate limit agreement with Anthropic — lower tiers will get an error.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
import os
from anthropic import Anthropic

client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

response = client.beta.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=8192,
    messages=[
        {
            "role": "user",
            "content": "Summarize the key findings across all attached documents.",
        }
    ],
    betas=["context-1m-2025-08-07"],
)

print(response.content[0].text)

Note the method is client.beta.messages.create, not client.messages.create — the beta namespace is required when passing betas. If you accidentally call client.messages.create with the betas parameter, the SDK will raise a TypeError.

Count Tokens Before Sending

Before firing off a 900K-token request, count the tokens. You pay 2x input pricing for anything above 200K tokens, so you want to know exactly what you’re about to spend. The token counting API mirrors the messages API and returns an accurate count matching what billing will use.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
import os
from anthropic import Anthropic

client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

def load_documents(paths: list[str]) -> str:
    """Concatenate multiple text files into a single document block."""
    parts = []
    for path in paths:
        with open(path, "r", encoding="utf-8") as f:
            parts.append(f"=== {path} ===\n{f.read()}")
    return "\n\n".join(parts)

document_paths = [
    "reports/q1_2025.txt",
    "reports/q2_2025.txt",
    "reports/q3_2025.txt",
    "reports/q4_2025.txt",
]

corpus = load_documents(document_paths)
system_prompt = "You are a financial analyst. Answer questions based only on the documents provided."

messages = [
    {
        "role": "user",
        "content": f"{corpus}\n\n---\n\nWhat are the top three risks mentioned across all four quarterly reports?",
    }
]

# Count first, decide whether to proceed
token_count = client.messages.count_tokens(
    model="claude-sonnet-4-6",
    system=system_prompt,
    messages=messages,
)

input_tokens = token_count.input_tokens
print(f"Token count: {input_tokens:,}")

# Estimate cost
BASE_RATE = 3.00 / 1_000_000       # $3 per MTok for first 200K
PREMIUM_RATE = 6.00 / 1_000_000    # 2x = $6 per MTok above 200K

if input_tokens <= 200_000:
    estimated_cost = input_tokens * BASE_RATE
else:
    estimated_cost = (200_000 * BASE_RATE) + ((input_tokens - 200_000) * PREMIUM_RATE)

print(f"Estimated input cost: ${estimated_cost:.4f}")

This pattern lets you gate on token count before sending, log costs, and catch accidental document explosions before they hit your billing.

What to Put Where in the Context Window

Position matters. Claude Sonnet 4.6 includes context awareness — the model is told its token budget at conversation start and receives updates after tool calls. But beyond that system-level feature, the ordering of your content still affects reasoning quality.

The general rule: documents in the middle, question at the end, system instructions at the top.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
import os
from anthropic import Anthropic

client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

def build_long_context_request(
    system_instructions: str,
    documents: dict[str, str],   # {"label": "content"}
    question: str,
    few_shot_examples: list[dict] | None = None,
) -> dict:
    """
    Build a well-structured long-context request.

    Placement strategy:
    - system: high-level role, output format, constraints
    - user[0]: few-shot examples if any (optional)
    - user[1]: document corpus, clearly delimited
    - user[2]: the actual question (last, so it's fresh in attention)
    """
    user_content_parts = []

    if few_shot_examples:
        example_text = "Here are examples of the output format I expect:\n\n"
        for ex in few_shot_examples:
            example_text += f"Q: {ex['question']}\nA: {ex['answer']}\n\n"
        user_content_parts.append(example_text.strip())

    doc_block = "DOCUMENT CORPUS\n" + "=" * 60 + "\n\n"
    for label, content in documents.items():
        doc_block += f"[START: {label}]\n{content}\n[END: {label}]\n\n"
    user_content_parts.append(doc_block.strip())

    user_content_parts.append(f"QUESTION\n{question}")

    return {
        "system": system_instructions,
        "messages": [
            {
                "role": "user",
                "content": "\n\n".join(user_content_parts),
            }
        ],
    }


# Example: analyze a large codebase
import pathlib

repo_files = {}
for path in pathlib.Path("./src").rglob("*.py"):
    try:
        repo_files[str(path)] = path.read_text(encoding="utf-8")
    except Exception:
        pass  # skip binary or unreadable files

request = build_long_context_request(
    system_instructions=(
        "You are a senior software engineer performing a code review. "
        "Point out security vulnerabilities, performance bottlenecks, and API misuse. "
        "Be specific: cite the file and line context."
    ),
    documents=repo_files,
    question="What are the three most critical issues in this codebase, and how should each be fixed?",
    few_shot_examples=[
        {
            "question": "What is the biggest bug?",
            "answer": "In auth/login.py, the session token is generated with random.random() (not cryptographically secure). Replace with secrets.token_hex(32).",
        }
    ],
)

response = client.beta.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=4096,
    betas=["context-1m-2025-08-07"],
    **request,
)

print(response.content[0].text)

The labeled delimiters ([START: filename] / [END: filename]) give Claude clear anchors to cite when it refers back to specific files in its response. Without them, long documents blur together and citations become vague.

Prompt Caching to Slash Costs

If you’re running multiple questions against the same large corpus — an analyst querying a document set repeatedly, or a QA pipeline running dozens of checks — prompt caching is the right move. Cache reads cost 10% of the base input price ($0.30/MTok vs $3.00/MTok for standard). Cache writes cost 25% more than base for the 5-minute TTL tier.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
import os
from anthropic import Anthropic

client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

def load_corpus(path: str) -> str:
    with open(path, "r", encoding="utf-8") as f:
        return f.read()

large_corpus = load_corpus("data/full_legal_archive.txt")

# Mark the corpus block as cacheable
questions = [
    "Which clauses create indemnification obligations for the licensee?",
    "What are the termination conditions and notice periods?",
    "List all jurisdiction-specific carve-outs.",
]

results = []

for question in questions:
    response = client.beta.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        betas=["context-1m-2025-08-07"],
        system=[
            {
                "type": "text",
                "text": "You are a legal analyst. Answer only from the documents provided. Quote relevant clauses.",
                "cache_control": {"type": "ephemeral"},  # 5-min TTL cache
            }
        ],
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": large_corpus,
                        "cache_control": {"type": "ephemeral"},
                    },
                    {
                        "type": "text",
                        "text": question,
                    },
                ],
            }
        ],
    )

    usage = response.usage
    results.append({
        "question": question,
        "answer": response.content[0].text,
        "cache_read_tokens": getattr(usage, "cache_read_input_tokens", 0),
        "cache_write_tokens": getattr(usage, "cache_creation_input_tokens", 0),
        "uncached_input_tokens": usage.input_tokens,
    })

    print(f"Q: {question[:60]}...")
    print(f"   Cache read: {results[-1]['cache_read_tokens']:,} | Write: {results[-1]['cache_write_tokens']:,}")
    print()

On the first call, the corpus writes to cache. Every subsequent call within 5 minutes reads from cache at 10x reduced cost. For a 600K-token corpus at Sonnet 4.6 pricing, that drops per-question input cost from ~$4.80 to ~$0.18 on cache hits.

Chunking as a Fallback

Not everything fits in 1M tokens — or you might not have tier 4 access yet. For those cases, map-reduce chunking is the reliable fallback.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
import os
from anthropic import Anthropic

client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

def chunk_text(text: str, chunk_size: int = 150_000, overlap: int = 2_000) -> list[str]:
    """
    Split text into overlapping chunks by character count.
    The overlap preserves sentence context at boundaries.
    """
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start += chunk_size - overlap
    return chunks

def map_reduce_summarize(text: str, question: str) -> str:
    """
    Step 1 (Map): Get partial answers from each chunk.
    Step 2 (Reduce): Synthesize partial answers into a final answer.
    """
    chunks = chunk_text(text)
    partial_answers = []

    for i, chunk in enumerate(chunks):
        print(f"Processing chunk {i + 1}/{len(chunks)}...")
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=1024,
            messages=[
                {
                    "role": "user",
                    "content": (
                        f"Document section {i + 1} of {len(chunks)}:\n\n{chunk}\n\n"
                        f"Question: {question}\n\n"
                        "Answer based only on this section. "
                        "If the section doesn't contain relevant information, say 'Not found in this section.'"
                    ),
                }
            ],
        )
        partial_answers.append(response.content[0].text)

    # Reduce step
    combined = "\n\n".join(
        f"[Section {i + 1} answer]\n{ans}" for i, ans in enumerate(partial_answers)
    )

    reduce_response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        messages=[
            {
                "role": "user",
                "content": (
                    f"You have partial answers from {len(chunks)} document sections:\n\n"
                    f"{combined}\n\n"
                    f"Original question: {question}\n\n"
                    "Synthesize a single, comprehensive final answer. "
                    "Reconcile contradictions. Ignore 'Not found' sections."
                ),
            }
        ],
    )

    return reduce_response.content[0].text


# Usage
with open("data/large_report.txt", "r") as f:
    document = f.read()

answer = map_reduce_summarize(
    document,
    "What are the main risk factors and recommended mitigations?",
)
print(answer)

Map-reduce is slower than single-shot (multiple API calls vs one) but has no tier requirement and works with the standard 200K window. Use it when you can’t get 1M access or when the document genuinely exceeds 1M tokens.

Common Errors and Fixes

anthropic.BadRequestError: context_length_exceeded You’ve exceeded the 200K limit without the beta header, or exceeded 1M with it. Either enable the beta header or reduce your content. Run token counting first.

anthropic.PermissionDeniedError: feature not available for your tier Your API key is not at usage tier 4. Check your tier in the Anthropic console. You need to advance your tier or contact Anthropic for a custom rate limit.

TypeError: create() got an unexpected keyword argument 'betas' You called client.messages.create() instead of client.beta.messages.create(). The beta namespace is required.

Responses that seem to ignore part of the document Long-range retrieval degrades for certain content types at the far end of very long prompts. Put the most critical excerpts near the start or near the question, not buried in the middle of a 900K-token block. Use clear delimiters and labels so the model has structural anchors.

Enabling the 1M Context Window#

Count Tokens Before Sending#

What to Put Where in the Context Window#

Prompt Caching to Slash Costs#

Chunking as a Fallback#

Common Errors and Fixes#

Related Guides#

About the Author