Most summarization approaches fall into two camps: extractive (copy-paste important sentences) and abstractive (generate brand new sentences that capture the meaning). Abstractive is harder but produces more natural, readable summaries. Google’s PEGASUS model was pre-trained specifically for abstractive summarization using a clever gap-sentence generation objective — it masks entire sentences during pre-training and learns to reconstruct them. The result is a model that’s remarkably good at generating concise, coherent summaries out of the box.
Here’s how to build a full summarization pipeline with PEGASUS using Hugging Face Transformers.
Load PEGASUS for Summarization#
The fastest way to get started is with the pipeline API. This handles tokenization, inference, and decoding in one call.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
| from transformers import pipeline
summarizer = pipeline(
"summarization",
model="google/pegasus-xsum",
tokenizer="google/pegasus-xsum",
)
text = """
Scientists have discovered a new species of deep-sea fish in the Mariana Trench.
The fish, named Pseudoliparis swirei, was found at a depth of approximately 8,000
meters. It has a translucent body and lacks scales, adaptations that help it survive
the extreme pressure of the deep ocean. Researchers believe the species feeds on
small crustaceans found at the bottom of the trench. The discovery sheds light on
the limits of vertebrate life in the deepest parts of the ocean.
"""
result = summarizer(text, max_length=60, min_length=20, num_beams=4)
print(result[0]["summary_text"])
|
The google/pegasus-xsum checkpoint is fine-tuned on the XSum dataset, which produces single-sentence summaries. If you need multi-sentence summaries, use google/pegasus-large or google/pegasus-cnn_dailymail instead.
For more control over generation, load the model and tokenizer directly:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
| from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_name = "google/pegasus-xsum"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
summary_ids = model.generate(
inputs["input_ids"],
max_length=60,
min_length=20,
num_beams=4,
length_penalty=2.0,
early_stopping=True,
)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print(summary)
|
Loading the model manually gives you access to all generation parameters. The length_penalty controls how much the model favors longer outputs — values above 1.0 encourage longer summaries, values below 1.0 encourage shorter ones.
Summarize Single Documents#
Generation parameters make a big difference in summary quality. Here are the ones you’ll tune most often:
max_length — Maximum number of tokens in the summary. Set this based on how long you want your output.min_length — Prevents the model from generating very short, incomplete summaries.num_beams — Beam search width. Higher values explore more candidates but take longer. 4-8 is a good range.length_penalty — Controls length preference during beam search. Default is 1.0.no_repeat_ngram_size — Prevents the model from repeating the same n-gram. Set to 3 to block trigram repetition.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
| from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_name = "google/pegasus-cnn_dailymail"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
article = """
The European Space Agency's Euclid telescope has transmitted its first full-color
images of the universe, revealing unprecedented detail of distant galaxies and dark
matter structures. The images cover a patch of sky about 100 times larger than what
the Hubble Space Telescope can capture in a single shot. Scientists say the data will
help map the distribution of dark matter across billions of light-years, offering
new clues about the accelerating expansion of the universe. The Euclid mission,
launched in July 2023, is designed to survey over a third of the sky during its
six-year mission. Researchers from more than 200 institutions across 13 countries
are collaborating on the analysis.
"""
inputs = tokenizer(
article,
return_tensors="pt",
truncation=True,
max_length=1024,
)
summary_ids = model.generate(
inputs["input_ids"],
max_length=128,
min_length=30,
num_beams=5,
length_penalty=1.5,
no_repeat_ngram_size=3,
early_stopping=True,
)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print(summary)
|
The google/pegasus-cnn_dailymail checkpoint produces multi-sentence summaries and handles longer inputs (up to 1024 tokens). Pick the checkpoint that matches your output style — XSum for single-sentence, CNN/DailyMail for multi-sentence.
Handle Long Documents with Chunking#
PEGASUS has a maximum input length (512 tokens for XSum, 1024 for CNN/DailyMail). Anything beyond that gets silently truncated. For long documents, you need to split the text into chunks, summarize each one, then optionally run a second pass to combine those summaries.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
| from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_name = "google/pegasus-cnn_dailymail"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
MAX_INPUT_TOKENS = 1024
def chunk_text(text: str, max_tokens: int = MAX_INPUT_TOKENS) -> list[str]:
"""Split text into chunks that fit within the model's token limit."""
sentences = text.replace("\n", " ").split(". ")
chunks = []
current_chunk = []
current_length = 0
for sentence in sentences:
sentence = sentence.strip()
if not sentence:
continue
token_count = len(tokenizer.encode(sentence, add_special_tokens=False))
if current_length + token_count > max_tokens and current_chunk:
chunks.append(". ".join(current_chunk) + ".")
current_chunk = [sentence]
current_length = token_count
else:
current_chunk.append(sentence)
current_length += token_count
if current_chunk:
chunks.append(". ".join(current_chunk) + ".")
return chunks
def summarize_text(text: str, max_length: int = 128) -> str:
"""Summarize a single chunk of text."""
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=MAX_INPUT_TOKENS)
summary_ids = model.generate(
inputs["input_ids"],
max_length=max_length,
min_length=30,
num_beams=4,
length_penalty=1.5,
no_repeat_ngram_size=3,
early_stopping=True,
)
return tokenizer.decode(summary_ids[0], skip_special_tokens=True)
def summarize_long_document(text: str) -> str:
"""Summarize a long document using chunk-then-merge strategy."""
chunks = chunk_text(text)
if len(chunks) == 1:
return summarize_text(chunks[0])
# Summarize each chunk
chunk_summaries = [summarize_text(chunk) for chunk in chunks]
print(f"Generated {len(chunk_summaries)} chunk summaries")
# Combine chunk summaries and summarize again
combined = " ".join(chunk_summaries)
final_summary = summarize_text(combined, max_length=200)
return final_summary
# Example with a long document
long_document = "Your very long document text goes here. " * 200
result = summarize_long_document(long_document)
print(result)
|
This two-pass approach — summarize chunks, then summarize the summaries — works well for documents up to around 10,000 words. For truly massive documents, you might need three passes or a hierarchical strategy. Keep in mind that each pass introduces some information loss, so the tradeoff between compression and accuracy matters.
Evaluate Summary Quality with ROUGE#
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures overlap between your generated summary and a reference summary. The three scores you’ll see most often:
- ROUGE-1 — Unigram overlap (individual words)
- ROUGE-2 — Bigram overlap (two-word phrases)
- ROUGE-L — Longest common subsequence
Install the evaluation library:
1
| pip install rouge-score
|
Then compute scores:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| from rouge_score import rouge_scorer
scorer = rouge_scorer.RougeScorer(
["rouge1", "rouge2", "rougeL"],
use_stemmer=True,
)
reference = "The Euclid telescope sent its first images showing distant galaxies and dark matter."
generated = "ESA's Euclid telescope has transmitted first color images revealing galaxies and dark matter structures."
scores = scorer.score(reference, generated)
for metric, score in scores.items():
print(f"{metric}: precision={score.precision:.3f}, recall={score.recall:.3f}, fmeasure={score.fmeasure:.3f}")
|
Output looks something like:
1
2
3
| rouge1: precision=0.643, recall=0.692, fmeasure=0.667
rouge2: precision=0.308, recall=0.333, fmeasure=0.320
rougeL: precision=0.500, recall=0.538, fmeasure=0.519
|
For batch evaluation across many documents, compute average scores:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
| from rouge_score import rouge_scorer
import numpy as np
scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=True)
references = [
"Scientists found a new deep-sea fish species in the Mariana Trench.",
"The telescope captured detailed images of distant galaxies.",
]
generated_summaries = [
"A new species of fish was discovered in the deep sea.",
"The Euclid telescope has taken its first images of galaxies.",
]
all_scores = {"rouge1": [], "rouge2": [], "rougeL": []}
for ref, gen in zip(references, generated_summaries):
scores = scorer.score(ref, gen)
for metric in all_scores:
all_scores[metric].append(scores[metric].fmeasure)
for metric, values in all_scores.items():
print(f"{metric}: {np.mean(values):.3f}")
|
A ROUGE-1 F-measure above 0.40 is decent for abstractive summarization. ROUGE-2 above 0.20 is solid. These numbers vary heavily by dataset and domain, so always compare against a baseline rather than targeting absolute thresholds.
Common Errors and Fixes#
Token indices sequence length is longer than the specified maximum sequence length
This warning means your input exceeds the model’s maximum length. The tokenizer truncates automatically when you pass truncation=True, but if you’re not using that flag, you’ll get this warning and potentially incorrect results.
Fix: Always pass truncation=True and max_length when tokenizing.
1
| inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
|
RuntimeError: CUDA out of memory
PEGASUS-large has about 568M parameters. On GPUs with less than 8GB VRAM, you’ll hit memory issues during beam search because each beam holds a separate copy of the sequence.
Fix: Reduce num_beams, use a smaller batch size, or switch to CPU for inference. You can also use half-precision:
1
| model = AutoModelForSeq2SeqLM.from_pretrained("google/pegasus-xsum").half().to("cuda")
|
ValueError: Unable to create tensor, you should probably activate truncation
This happens when you pass raw text that exceeds the model’s maximum length without enabling truncation. The tokenizer can’t create a tensor larger than the model expects.
Fix: Same as above — add truncation=True to your tokenizer call. For long documents, use the chunking approach from the section above instead of relying on truncation, which cuts off content without any intelligence about what it’s dropping.
Summaries are repetitive or contain repeated phrases
PEGASUS sometimes generates repetitive output, especially with greedy decoding or low beam counts.
Fix: Set no_repeat_ngram_size=3 to block trigram repetition, and increase num_beams to give the model more candidates to choose from.