You have a 200-page contract, a quarterly earnings report, or a research paper dump, and you need a summary. You paste the whole thing into an LLM and get this back:
| |
The document is too long. Even models with 128k context windows hit their ceiling on real-world documents, and stuffing everything into one prompt degrades quality anyway – LLMs lose track of details buried in the middle of long inputs (the “lost in the middle” problem).
The fix is map-reduce summarization: split the document into chunks, summarize each chunk independently (the “map” step), then combine those summaries into a final summary (the “reduce” step). This works on documents of any length, parallelizes well, and produces better results than cramming everything into one call.
The Full Pipeline
Install the dependencies:
| |
Here is the complete map-reduce summarizer. It counts tokens properly, chunks the document, summarizes each chunk in parallel, and collapses the results.
| |
Run it on a 50-page document and you will see output like:
| |
The first pass produces 16 chunk summaries. The combined summaries are short enough that the second pass collapses them into one final summary.
Why Not Just Use a Bigger Context Window?
Tempting, but two problems. First, even 128k-token models reject documents that exceed their limit. Second, long-context performance degrades. Research shows LLMs struggle with information in the middle of very long prompts – facts at the start and end get recalled well, but details buried at position 40k-80k get missed. Map-reduce avoids this by giving each chunk the model’s full attention.
Cost matters too. Summarizing a 50k-token document in one shot with gpt-4o costs roughly $0.13 in input tokens alone. Map-reduce with gpt-4o-mini on the same document costs about $0.01 total, because each chunk is small and the model is 20x cheaper.
Map-Reduce vs. Refine
There are two main strategies for multi-chunk summarization.
Map-reduce summarizes every chunk independently, then merges the summaries. It is fast because the map step parallelizes. The downside is that each chunk summary is created without context from other chunks, so cross-references between sections can get lost.
Refine processes chunks sequentially. It summarizes chunk 1, then feeds that summary alongside chunk 2 to produce an updated summary, and so on. This preserves more context across the document, but it is strictly sequential – you cannot parallelize it, and the number of LLM calls is the same. On a 20-chunk document, refine takes roughly 4-5x longer than map-reduce.
Use map-reduce as the default. Switch to refine only when cross-section context is critical, like summarizing a legal contract where clause 12 modifies clause 3.
Chunking Strategies That Actually Matter
The chunking function above splits on paragraph boundaries. This works well for most documents because paragraphs are natural semantic units. Here are the options ranked by effectiveness for summarization:
Paragraph-boundary splitting (used above) – respects the author’s own structure. Best general-purpose choice.
Recursive character splitting – LangChain’s RecursiveCharacterTextSplitter tries \n\n, then \n, then spaces. Good fallback for messy text.
| |
Semantic chunking – groups sentences by embedding similarity. Higher quality but adds latency and cost from the embedding calls. Overkill for straightforward summarization.
Fixed-size splitting (every N tokens) – the worst option. It cuts mid-sentence, mid-paragraph, and produces noticeably worse summaries. Avoid it.
One important detail: always pass a token-aware length_function to your splitter. Using Python’s len() counts characters, not tokens, and you will either overflow the context window or waste capacity with chunks that are too small.
Common Errors and Fixes
Token limit exceeded despite chunking
| |
Your chunks are sized correctly, but the system prompt plus chunk content exceeds the limit. Leave headroom – if the model has 128k tokens, keep chunks under 120k to account for the prompt and response tokens. The pipeline above uses 3k-token chunks specifically to avoid this.
Rate limit errors on parallel calls
| |
Reduce max_workers or add exponential backoff:
| |
Summaries are too generic
The default prompt “summarize this” produces bland output. Be specific about what matters:
| |
Domain-specific prompts consistently outperform generic ones. Tell the model what to preserve, and it will.
Lost information across chunks
If your document has a table of abbreviations on page 1 and uses those abbreviations throughout, the later chunks will not have that context. Fix this by prepending shared context to every chunk:
| |
When to Skip Map-Reduce Entirely
If your document fits within the model’s context window with room to spare, just send the whole thing. There is no benefit to chunking a 5-page memo for a model that handles 128k tokens. Check first:
| |
Map-reduce adds latency, cost, and complexity. Use it when you need it, not as a default for every document.
Related Guides
- How to Build a Text-to-SQL Pipeline with LLMs
- How to Extract Structured Data from PDFs with LLMs
- How to Classify Text with Zero-Shot and Few-Shot LLMs
- How to Build a Text Correction and Grammar Checking Pipeline
- How to Build a Document Chunking Strategy Comparison Pipeline
- How to Build a Text Style Transfer Pipeline with Transformers
- How to Build a Named Entity Recognition Pipeline with spaCy and Transformers
- How to Build a Spell Checking and Autocorrect Pipeline with Python
- How to Build a Sentiment Analysis API with Transformers and FastAPI
- How to Build a Text Classification Pipeline with SetFit