The Quick Version

Every LLM has a context window limit. When your input exceeds it, the API returns an error or silently truncates. The fix isn’t just “use a bigger model” — it’s designing your application to handle large inputs gracefully.

Count tokens before sending requests:

1
pip install tiktoken openai
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
import tiktoken

def count_tokens(text: str, model: str = "gpt-4o") -> int:
    encoding = tiktoken.encoding_for_model(model)
    return len(encoding.encode(text))

document = open("long_report.txt").read()
tokens = count_tokens(document)
print(f"Document: {tokens} tokens")

# GPT-4o context: 128K tokens
# Claude 3.5: 200K tokens
# Gemini 1.5: 1M tokens
if tokens > 120_000:
    print("Need to chunk or summarize before sending")

This gives you the exact token count so you can decide whether to chunk, summarize, or truncate before hitting the API.

Token Counting for Different Providers

Each provider uses different tokenizers. Using the wrong one gives you inaccurate counts and unexpected truncation.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
import tiktoken
from anthropic import Anthropic

# OpenAI models
def openai_tokens(text: str, model: str = "gpt-4o") -> int:
    enc = tiktoken.encoding_for_model(model)
    return len(enc.encode(text))

# Anthropic Claude — use their API's built-in counter
client = Anthropic()
def claude_tokens(text: str) -> int:
    response = client.messages.count_tokens(
        model="claude-sonnet-4-5-20250929",
        messages=[{"role": "user", "content": text}],
    )
    return response.input_tokens

# Quick estimate for any model (rough but fast)
def estimate_tokens(text: str) -> int:
    """~4 characters per token for English text"""
    return len(text) // 4

The 4-chars-per-token rule is a decent first approximation. For production code, always use the provider-specific tokenizer.

Chunking Strategies for Large Documents

When a document exceeds your context window, split it into overlapping chunks. The overlap prevents losing context at chunk boundaries.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
from tiktoken import encoding_for_model

def chunk_text(text: str, max_tokens: int = 4000, overlap: int = 200, model: str = "gpt-4o") -> list[str]:
    enc = encoding_for_model(model)
    tokens = enc.encode(text)
    chunks = []
    start = 0

    while start < len(tokens):
        end = start + max_tokens
        chunk_tokens = tokens[start:end]
        chunks.append(enc.decode(chunk_tokens))
        start = end - overlap  # slide back for overlap

    return chunks

# Process each chunk independently
document = open("annual_report.txt").read()
chunks = chunk_text(document, max_tokens=4000, overlap=200)
print(f"Split into {len(chunks)} chunks")

Choosing Chunk Size and Overlap

Smaller chunks (1000-2000 tokens) work better for retrieval — they’re more focused and match queries more precisely. Larger chunks (4000-8000 tokens) work better when you need the LLM to understand broader context, like summarizing a chapter.

For overlap, 10-15% of your chunk size is the sweet spot. Too little overlap and you lose sentences that span boundaries. Too much and you waste tokens on redundant content.

Sliding Window for Conversations

Long conversations hit context limits fast. A sliding window keeps the most recent messages while preserving the system prompt and key context.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
from openai import OpenAI
import tiktoken

client = OpenAI()

def trim_conversation(
    messages: list[dict],
    max_tokens: int = 8000,
    model: str = "gpt-4o"
) -> list[dict]:
    enc = tiktoken.encoding_for_model(model)

    # Always keep the system message
    system_msgs = [m for m in messages if m["role"] == "system"]
    other_msgs = [m for m in messages if m["role"] != "system"]

    system_tokens = sum(len(enc.encode(m["content"])) for m in system_msgs)
    budget = max_tokens - system_tokens

    # Keep messages from most recent, working backwards
    kept = []
    used = 0
    for msg in reversed(other_msgs):
        msg_tokens = len(enc.encode(msg["content"]))
        if used + msg_tokens > budget:
            break
        kept.insert(0, msg)
        used += msg_tokens

    return system_msgs + kept

# Usage
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    # ... hundreds of messages ...
]
messages.append({"role": "user", "content": "What did we discuss earlier?"})

trimmed = trim_conversation(messages, max_tokens=8000)
response = client.chat.completions.create(model="gpt-4o", messages=trimmed)

The tradeoff: the model loses memory of early conversation turns. For important context, extract key facts into the system prompt or use a summary buffer.

Map-Reduce for Documents That Don’t Fit

When you need to process an entire large document (not just retrieve from it), map-reduce is the standard pattern. Summarize each chunk, then summarize the summaries.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
from openai import OpenAI

client = OpenAI()

def map_reduce_summarize(chunks: list[str], model: str = "gpt-4o") -> str:
    # Map: summarize each chunk
    summaries = []
    for i, chunk in enumerate(chunks):
        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": "Summarize this section concisely. Keep key facts and numbers."},
                {"role": "user", "content": chunk},
            ],
            max_tokens=500,
        )
        summaries.append(response.choices[0].message.content)
        print(f"Summarized chunk {i+1}/{len(chunks)}")

    # Reduce: combine summaries into final summary
    combined = "\n\n---\n\n".join(summaries)
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": "Combine these section summaries into one coherent summary. Preserve all key details."},
            {"role": "user", "content": combined},
        ],
        max_tokens=1000,
    )
    return response.choices[0].message.content

chunks = chunk_text(open("10k_filing.txt").read(), max_tokens=4000)
summary = map_reduce_summarize(chunks)
print(summary)

For documents under 50 chunks, this works well. Beyond that, add an intermediate reduce step to group chunks by section before the final summary.

Common Errors and Fixes

InvalidRequestError: This model's maximum context length is X tokens

You sent more tokens than the model supports. Count tokens before the request and truncate or chunk as needed. Don’t just catch the error — it wastes an API call.

Summaries lose critical details

Your chunk size is too large for the summary prompt. Reduce chunk size to 2000-3000 tokens so the model can focus on each section. Also, add explicit instructions like “preserve all numbers, dates, and named entities.”

Overlap creates duplicate content in outputs

When processing chunks that overlap, the same sentences appear in multiple chunk responses. Deduplicate at the reduce step, or use a “continue from where you left off” prompt pattern instead of overlap.

Token count doesn’t match API error

You’re counting tokens for the input only, but the API includes the system prompt, function definitions, and response tokens in its limit. Reserve at least 1000-2000 tokens for the response and any tool schemas.

Which Strategy When

Use chunking + retrieval (RAG) when users ask questions about a large document. You only need the relevant chunks, not the whole thing.

Use sliding window for chat applications where recent context matters most and older turns can be dropped.

Use map-reduce when you need to process the entire document — summarization, extraction, or analysis that requires seeing everything.

Use a bigger context window (Gemini 1.5 Pro at 1M tokens) when the document fits and you can afford the latency and cost. Longer contexts are slower and more expensive per request, but they avoid the complexity of chunking entirely.