The Quick Version
Every LLM has a context window limit. When your input exceeds it, the API returns an error or silently truncates. The fix isn’t just “use a bigger model” — it’s designing your application to handle large inputs gracefully.
Count tokens before sending requests:
| |
| |
This gives you the exact token count so you can decide whether to chunk, summarize, or truncate before hitting the API.
Token Counting for Different Providers
Each provider uses different tokenizers. Using the wrong one gives you inaccurate counts and unexpected truncation.
| |
The 4-chars-per-token rule is a decent first approximation. For production code, always use the provider-specific tokenizer.
Chunking Strategies for Large Documents
When a document exceeds your context window, split it into overlapping chunks. The overlap prevents losing context at chunk boundaries.
| |
Choosing Chunk Size and Overlap
Smaller chunks (1000-2000 tokens) work better for retrieval — they’re more focused and match queries more precisely. Larger chunks (4000-8000 tokens) work better when you need the LLM to understand broader context, like summarizing a chapter.
For overlap, 10-15% of your chunk size is the sweet spot. Too little overlap and you lose sentences that span boundaries. Too much and you waste tokens on redundant content.
Sliding Window for Conversations
Long conversations hit context limits fast. A sliding window keeps the most recent messages while preserving the system prompt and key context.
| |
The tradeoff: the model loses memory of early conversation turns. For important context, extract key facts into the system prompt or use a summary buffer.
Map-Reduce for Documents That Don’t Fit
When you need to process an entire large document (not just retrieve from it), map-reduce is the standard pattern. Summarize each chunk, then summarize the summaries.
| |
For documents under 50 chunks, this works well. Beyond that, add an intermediate reduce step to group chunks by section before the final summary.
Common Errors and Fixes
InvalidRequestError: This model's maximum context length is X tokens
You sent more tokens than the model supports. Count tokens before the request and truncate or chunk as needed. Don’t just catch the error — it wastes an API call.
Summaries lose critical details
Your chunk size is too large for the summary prompt. Reduce chunk size to 2000-3000 tokens so the model can focus on each section. Also, add explicit instructions like “preserve all numbers, dates, and named entities.”
Overlap creates duplicate content in outputs
When processing chunks that overlap, the same sentences appear in multiple chunk responses. Deduplicate at the reduce step, or use a “continue from where you left off” prompt pattern instead of overlap.
Token count doesn’t match API error
You’re counting tokens for the input only, but the API includes the system prompt, function definitions, and response tokens in its limit. Reserve at least 1000-2000 tokens for the response and any tool schemas.
Which Strategy When
Use chunking + retrieval (RAG) when users ask questions about a large document. You only need the relevant chunks, not the whole thing.
Use sliding window for chat applications where recent context matters most and older turns can be dropped.
Use map-reduce when you need to process the entire document — summarization, extraction, or analysis that requires seeing everything.
Use a bigger context window (Gemini 1.5 Pro at 1M tokens) when the document fits and you can afford the latency and cost. Longer contexts are slower and more expensive per request, but they avoid the complexity of chunking entirely.
Related Guides
- How to Build Multi-Turn Chatbots with Conversation Memory
- How to Use Claude Sonnet 4.6’s 1M Token Context Window for Long-Document Reasoning
- How to Route Prompts to the Best LLM with a Semantic Router
- How to Build Prompt Regression Tests with LLM-as-Judge
- How to Build Token-Efficient Prompt Batching with LLM APIs
- How to Build RAG Applications with LangChain and ChromaDB
- How to Compress Prompts and Reduce Token Usage in LLM Applications
- How to Build Retrieval-Augmented Prompts with Contextual Grounding
- How to Build Agentic RAG with Query Routing and Self-Reflection
- How to Build Automatic Prompt Optimization with DSPy