How to Build a Text Summarization Pipeline with Sumy and Transformers

Text summarization falls into two camps: extractive and abstractive. Extractive summarization picks the most important sentences straight from the source text and stitches them together. Abstractive summarization generates new sentences that capture the meaning, much like a human would write a summary from memory. Neither approach is perfect on its own. Extractive methods preserve factual accuracy but produce choppy output. Abstractive models write fluently but sometimes hallucinate details that weren’t in the original.

The best practical approach is a hybrid pipeline: use extractive summarization to cut a long document down to its key sentences, then feed that condensed text into an abstractive model for a polished final summary. This keeps the abstractive model focused on relevant content and avoids hitting token limits on long inputs.

Extractive Summarization with Sumy

Sumy is a lightweight Python library that implements several classic extractive summarization algorithms. Install it along with its tokenizer data:

1
2
pip install sumy rouge-score transformers torch
python -c "import nltk; nltk.download('punkt_tab')"

Here’s how to extract key sentences using both LexRank and LSA:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lex_rank import LexRankSummarizer
from sumy.summarizers.lsa import LsaSummarizer

article = """
Renewable energy sources have seen unprecedented growth over the past decade.
Solar panel installations have increased by over 400 percent since 2015, driven
by falling manufacturing costs and government incentives. Wind energy capacity
has doubled in the same period, with offshore wind farms becoming increasingly
viable. Battery storage technology has also advanced significantly, addressing
the intermittency problem that has long plagued renewable energy adoption.
Despite these gains, fossil fuels still account for roughly 60 percent of global
electricity generation. Experts argue that reaching net-zero emissions by 2050
will require annual renewable energy investment to triple from current levels.
The International Energy Agency projects that solar energy alone could supply
up to 40 percent of global electricity by 2040 if current trends continue.
Emerging technologies like green hydrogen and advanced geothermal systems may
further accelerate the transition away from carbon-intensive power sources.
"""

parser = PlaintextParser.from_string(article, Tokenizer("english"))

# LexRank: graph-based ranking of sentence importance
lexrank = LexRankSummarizer()
lexrank_summary = lexrank(parser.document, sentences_count=3)

print("=== LexRank Summary ===")
for sentence in lexrank_summary:
    print(sentence)

# LSA: Latent Semantic Analysis approach
lsa = LsaSummarizer()
lsa_summary = lsa(parser.document, sentences_count=3)

print("\n=== LSA Summary ===")
for sentence in lsa_summary:
    print(sentence)

LexRank builds a graph of sentence similarity and ranks sentences the way PageRank ranks web pages. LSA uses singular value decomposition to find the most semantically significant sentences. Both work well for different text types. LexRank tends to pick more diverse sentences, while LSA favors sentences that cover the dominant topics.

The sentences_count parameter controls how many sentences to extract. For a hybrid pipeline, you want enough sentences to preserve key information but few enough to stay under the abstractive model’s token limit. Three to five sentences works well for articles under 1000 words.

Abstractive Summarization with Transformers

Hugging Face Transformers ships pre-trained summarization models that you can run with a single pipeline call. facebook/bart-large-cnn is trained on CNN/DailyMail articles and produces solid news-style summaries:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
from transformers import pipeline

summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

text = " ".join(str(s) for s in lexrank_summary)

result = summarizer(
    text,
    max_length=80,
    min_length=20,
    do_sample=False,
)

print("=== Abstractive Summary ===")
print(result[0]["summary_text"])

The do_sample=False flag uses beam search for deterministic output. Set do_sample=True with a temperature parameter if you want more creative summaries, though that increases hallucination risk for factual content.

If you want shorter, punchier summaries, swap in google/pegasus-xsum instead. It was trained on BBC article summaries that condense stories into a single sentence:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
xsum_summarizer = pipeline("summarization", model="google/pegasus-xsum")

xsum_result = xsum_summarizer(
    article,
    max_length=60,
    min_length=10,
    do_sample=False,
)

print("=== PEGASUS XSum Summary ===")
print(xsum_result[0]["summary_text"])

Building a Hybrid Pipeline

The real win comes from chaining extractive and abstractive stages. The extractive step strips out filler and keeps only the most informative sentences. The abstractive step rewrites those sentences into a coherent, fluent summary. This is especially useful for long documents where feeding the full text to BART or PEGASUS would exceed the 1024-token input limit.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.lex_rank import LexRankSummarizer
from transformers import pipeline


def hybrid_summarize(
    text: str,
    extractive_sentences: int = 5,
    max_length: int = 100,
    min_length: int = 30,
) -> dict:
    """Two-stage summarization: extractive reduction then abstractive polish."""
    # Stage 1: Extractive — pull out the top sentences
    parser = PlaintextParser.from_string(text, Tokenizer("english"))
    lexrank = LexRankSummarizer()
    extracted = lexrank(parser.document, sentences_count=extractive_sentences)
    extractive_text = " ".join(str(s) for s in extracted)

    # Stage 2: Abstractive — generate a fluent summary
    summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
    result = summarizer(
        extractive_text,
        max_length=max_length,
        min_length=min_length,
        do_sample=False,
    )

    return {
        "extractive": extractive_text,
        "abstractive": result[0]["summary_text"],
    }


long_document = """
The development of large language models has transformed natural language
processing research and applications. Models like GPT-4, Claude, and Gemini
can perform tasks ranging from translation to code generation with minimal
task-specific training. However, these models require enormous computational
resources to train, with estimates suggesting that training a frontier model
costs upwards of 100 million dollars. This has raised concerns about the
concentration of AI capabilities in a small number of well-funded organizations.
Open-source alternatives like LLaMA and Mistral have emerged to democratize
access, though they typically lag behind proprietary models on complex reasoning
benchmarks. Fine-tuning techniques such as LoRA and QLoRA have made it feasible
to adapt these open models to specific domains using consumer hardware. The
research community continues to explore more efficient training methods,
including mixture-of-experts architectures and distillation approaches that
transfer knowledge from large models to smaller, more deployable ones.
Meanwhile, concerns about safety, alignment, and misuse have prompted calls
for regulation and industry self-governance frameworks.
"""

output = hybrid_summarize(long_document, extractive_sentences=4, max_length=80)
print("=== Extractive Stage ===")
print(output["extractive"])
print("\n=== Final Abstractive Summary ===")
print(output["abstractive"])

In production, you’d load the Transformers pipeline once at startup and reuse it across calls rather than creating it inside the function. The example above keeps things self-contained for clarity.

Evaluating Summary Quality with ROUGE

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures overlap between a generated summary and a reference summary. ROUGE-1 counts unigram overlap, ROUGE-2 counts bigram overlap, and ROUGE-L measures the longest common subsequence.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
from rouge_score import rouge_scorer

reference = (
    "Renewable energy has grown rapidly, with solar up 400 percent and wind "
    "capacity doubled since 2015. Battery storage improvements help address "
    "intermittency. Reaching net-zero by 2050 requires tripling annual "
    "renewable investment."
)

# Compare extractive vs abstractive summaries against the reference
extractive_summary = (
    "Solar panel installations have increased by over 400 percent since 2015. "
    "Wind energy capacity has doubled in the same period. "
    "Battery storage technology has also advanced significantly."
)

abstractive_summary = (
    "Solar installations have surged over 400 percent since 2015, while wind "
    "energy capacity has doubled. Battery storage advances are helping address "
    "the intermittency challenge facing renewable energy adoption."
)

scorer = rouge_scorer.RougeScorer(
    ["rouge1", "rouge2", "rougeL"], use_stemmer=True
)

for label, summary in [
    ("Extractive", extractive_summary),
    ("Abstractive", abstractive_summary),
]:
    scores = scorer.score(reference, summary)
    print(f"\n=== {label} ROUGE Scores ===")
    for metric, values in scores.items():
        print(f"  {metric}: precision={values.precision:.3f} "
              f"recall={values.recall:.3f} fmeasure={values.fmeasure:.3f}")

Higher ROUGE-2 and ROUGE-L scores generally correlate with better summaries. Abstractive summaries often score slightly lower on ROUGE because they rephrase things, even when humans prefer them. Use ROUGE as a sanity check, not as the sole quality metric. For production systems, pair it with human evaluation or an LLM-as-judge approach.

Common Errors and Fixes

LookupError: Resource punkt_tab not found – Sumy depends on NLTK’s sentence tokenizer. Run python -c "import nltk; nltk.download('punkt_tab')" before using Sumy for the first time.

IndexError: index out of range from Transformers – This happens when the input text exceeds the model’s maximum token length (1024 tokens for BART, 512 for PEGASUS-XSum). Use the extractive stage to trim the input, or truncate explicitly with truncation=True in the pipeline call.

Summaries repeat themselves – BART sometimes generates repetitive phrases. Add no_repeat_ngram_size=3 to the pipeline call to prevent any 3-gram from appearing twice in the output:

1
2
3
4
5
6
7
result = summarizer(
    text,
    max_length=100,
    min_length=30,
    do_sample=False,
    no_repeat_ngram_size=3,
)

Empty extractive output – If sentences_count exceeds the number of sentences in the document, Sumy returns all available sentences without error. This is fine, but be aware that asking for 10 sentences from a 3-sentence paragraph just gives you the full text back.

Slow first inference – The first call to a Transformers pipeline downloads and loads the model. For facebook/bart-large-cnn this is about 1.6 GB. Subsequent calls reuse the cached model. In production, load the pipeline at startup time, not per-request.

Extractive Summarization with Sumy#

Abstractive Summarization with Transformers#

Building a Hybrid Pipeline#

Evaluating Summary Quality with ROUGE#

Common Errors and Fixes#

Related Guides#

About the Author

Extractive Summarization with Sumy

Abstractive Summarization with Transformers

Building a Hybrid Pipeline

Evaluating Summary Quality with ROUGE

Common Errors and Fixes

Related Guides