You have a 200-page contract, a quarterly earnings report, or a research paper dump, and you need a summary. You paste the whole thing into an LLM and get this back:

1
2
openai.BadRequestError: This model's maximum context length is 128000 tokens.
However, your messages resulted in 143291 tokens. Please reduce the length of the messages.

The document is too long. Even models with 128k context windows hit their ceiling on real-world documents, and stuffing everything into one prompt degrades quality anyway – LLMs lose track of details buried in the middle of long inputs (the “lost in the middle” problem).

The fix is map-reduce summarization: split the document into chunks, summarize each chunk independently (the “map” step), then combine those summaries into a final summary (the “reduce” step). This works on documents of any length, parallelizes well, and produces better results than cramming everything into one call.

The Full Pipeline

Install the dependencies:

1
pip install openai tiktoken

Here is the complete map-reduce summarizer. It counts tokens properly, chunks the document, summarizes each chunk in parallel, and collapses the results.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
import tiktoken
from openai import OpenAI
from concurrent.futures import ThreadPoolExecutor, as_completed

client = OpenAI()
MODEL = "gpt-4o-mini"
ENCODING = tiktoken.encoding_for_model(MODEL)

# Keep chunks well under the context limit.
# gpt-4o-mini has 128k context, but smaller chunks produce better summaries.
MAX_CHUNK_TOKENS = 3000
SUMMARY_PROMPT = "Summarize the following text. Preserve key facts, numbers, and names."


def count_tokens(text: str) -> int:
    return len(ENCODING.encode(text))


def chunk_text(text: str, max_tokens: int = MAX_CHUNK_TOKENS) -> list[str]:
    """Split text into chunks that fit within the token limit.

    Uses paragraph boundaries first, falls back to sentence splitting.
    """
    paragraphs = text.split("\n\n")
    chunks = []
    current_chunk = []
    current_tokens = 0

    for para in paragraphs:
        para_tokens = count_tokens(para)

        # Single paragraph exceeds limit -- split by sentences
        if para_tokens > max_tokens:
            if current_chunk:
                chunks.append("\n\n".join(current_chunk))
                current_chunk = []
                current_tokens = 0

            sentences = para.replace(". ", ".\n").split("\n")
            for sentence in sentences:
                sent_tokens = count_tokens(sentence)
                if current_tokens + sent_tokens > max_tokens and current_chunk:
                    chunks.append(" ".join(current_chunk))
                    current_chunk = []
                    current_tokens = 0
                current_chunk.append(sentence)
                current_tokens += sent_tokens
            continue

        if current_tokens + para_tokens > max_tokens and current_chunk:
            chunks.append("\n\n".join(current_chunk))
            current_chunk = []
            current_tokens = 0

        current_chunk.append(para)
        current_tokens += para_tokens

    if current_chunk:
        chunks.append("\n\n".join(current_chunk))

    return chunks


def summarize_chunk(chunk: str) -> str:
    response = client.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": SUMMARY_PROMPT},
            {"role": "user", "content": chunk},
        ],
        temperature=0.3,
    )
    return response.choices[0].message.content


def map_reduce_summarize(text: str, max_workers: int = 5) -> str:
    chunks = chunk_text(text)
    print(f"Split into {len(chunks)} chunks")

    # Map: summarize each chunk in parallel
    summaries = [None] * len(chunks)
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        future_to_idx = {
            executor.submit(summarize_chunk, chunk): i
            for i, chunk in enumerate(chunks)
        }
        for future in as_completed(future_to_idx):
            idx = future_to_idx[future]
            summaries[idx] = future.result()

    combined = "\n\n".join(summaries)

    # Reduce: if combined summaries still fit in one call, summarize once more
    if count_tokens(combined) > MAX_CHUNK_TOKENS:
        # Recursive reduce for very long documents
        return map_reduce_summarize(combined, max_workers=max_workers)

    response = client.chat.completions.create(
        model=MODEL,
        messages=[
            {
                "role": "system",
                "content": (
                    "You are given summaries of different sections of a document. "
                    "Combine them into a single coherent summary. Remove redundancy. "
                    "Preserve all key facts, numbers, and conclusions."
                ),
            },
            {"role": "user", "content": combined},
        ],
        temperature=0.3,
    )
    return response.choices[0].message.content


# Usage
with open("long_report.txt") as f:
    document = f.read()

print(f"Document length: {count_tokens(document)} tokens")
summary = map_reduce_summarize(document)
print(summary)

Run it on a 50-page document and you will see output like:

1
2
3
4
Document length: 47832 tokens
Split into 16 chunks
Document length: 2841 tokens
Split into 1 chunks

The first pass produces 16 chunk summaries. The combined summaries are short enough that the second pass collapses them into one final summary.

Why Not Just Use a Bigger Context Window?

Tempting, but two problems. First, even 128k-token models reject documents that exceed their limit. Second, long-context performance degrades. Research shows LLMs struggle with information in the middle of very long prompts – facts at the start and end get recalled well, but details buried at position 40k-80k get missed. Map-reduce avoids this by giving each chunk the model’s full attention.

Cost matters too. Summarizing a 50k-token document in one shot with gpt-4o costs roughly $0.13 in input tokens alone. Map-reduce with gpt-4o-mini on the same document costs about $0.01 total, because each chunk is small and the model is 20x cheaper.

Map-Reduce vs. Refine

There are two main strategies for multi-chunk summarization.

Map-reduce summarizes every chunk independently, then merges the summaries. It is fast because the map step parallelizes. The downside is that each chunk summary is created without context from other chunks, so cross-references between sections can get lost.

Refine processes chunks sequentially. It summarizes chunk 1, then feeds that summary alongside chunk 2 to produce an updated summary, and so on. This preserves more context across the document, but it is strictly sequential – you cannot parallelize it, and the number of LLM calls is the same. On a 20-chunk document, refine takes roughly 4-5x longer than map-reduce.

Use map-reduce as the default. Switch to refine only when cross-section context is critical, like summarizing a legal contract where clause 12 modifies clause 3.

Chunking Strategies That Actually Matter

The chunking function above splits on paragraph boundaries. This works well for most documents because paragraphs are natural semantic units. Here are the options ranked by effectiveness for summarization:

Paragraph-boundary splitting (used above) – respects the author’s own structure. Best general-purpose choice.

Recursive character splitting – LangChain’s RecursiveCharacterTextSplitter tries \n\n, then \n, then spaces. Good fallback for messy text.

1
2
3
4
5
6
7
8
from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=3000,
    chunk_overlap=200,
    length_function=count_tokens,  # Use tiktoken, not len()
)
chunks = splitter.split_text(document)

Semantic chunking – groups sentences by embedding similarity. Higher quality but adds latency and cost from the embedding calls. Overkill for straightforward summarization.

Fixed-size splitting (every N tokens) – the worst option. It cuts mid-sentence, mid-paragraph, and produces noticeably worse summaries. Avoid it.

One important detail: always pass a token-aware length_function to your splitter. Using Python’s len() counts characters, not tokens, and you will either overflow the context window or waste capacity with chunks that are too small.

Common Errors and Fixes

Token limit exceeded despite chunking

1
openai.BadRequestError: This model's maximum context length is 128000 tokens.

Your chunks are sized correctly, but the system prompt plus chunk content exceeds the limit. Leave headroom – if the model has 128k tokens, keep chunks under 120k to account for the prompt and response tokens. The pipeline above uses 3k-token chunks specifically to avoid this.

Rate limit errors on parallel calls

1
openai.RateLimitError: Rate limit reached for gpt-4o-mini on tokens per min (TPM)

Reduce max_workers or add exponential backoff:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
import time
from openai import RateLimitError

def summarize_chunk_with_retry(chunk: str, max_retries: int = 3) -> str:
    for attempt in range(max_retries):
        try:
            return summarize_chunk(chunk)
        except RateLimitError:
            wait = 2 ** attempt
            print(f"Rate limited, waiting {wait}s...")
            time.sleep(wait)
    raise RuntimeError("Failed after max retries")

Summaries are too generic

The default prompt “summarize this” produces bland output. Be specific about what matters:

1
2
3
4
5
SUMMARY_PROMPT = (
    "Summarize the following section of a financial report. "
    "Include all specific numbers, percentages, revenue figures, "
    "and named entities. Do not generalize."
)

Domain-specific prompts consistently outperform generic ones. Tell the model what to preserve, and it will.

Lost information across chunks

If your document has a table of abbreviations on page 1 and uses those abbreviations throughout, the later chunks will not have that context. Fix this by prepending shared context to every chunk:

1
2
3
4
5
preamble = "Abbreviations: ROI = Return on Investment, ARR = Annual Recurring Revenue, ..."

def summarize_chunk_with_context(chunk: str, context: str = preamble) -> str:
    full_input = f"Context:\n{context}\n\nText to summarize:\n{chunk}"
    return summarize_chunk(full_input)

When to Skip Map-Reduce Entirely

If your document fits within the model’s context window with room to spare, just send the whole thing. There is no benefit to chunking a 5-page memo for a model that handles 128k tokens. Check first:

1
2
3
4
5
token_count = count_tokens(document)
if token_count < 100_000:  # Comfortable margin for 128k model
    summary = summarize_chunk(document)  # Single-pass, no chunking needed
else:
    summary = map_reduce_summarize(document)

Map-reduce adds latency, cost, and complexity. Use it when you need it, not as a default for every document.