Paraphrasing is harder than it looks. You need to rewrite text so the meaning stays the same but the words change enough to be useful – for data augmentation, plagiarism removal, or generating training pairs for NLP models. Rule-based synonym swapping produces garbage. Fine-tuned seq2seq models produce real paraphrases.
Here’s how to build a paraphrasing pipeline with T5 and PEGASUS using Hugging Face Transformers, generate multiple candidates, and pick the best one.
Quick Start with the Pipeline API#
The fastest path to paraphrases is transformers.pipeline. The humarin/chatgpt_paraphraser_on_T5_base checkpoint is a T5 model fine-tuned specifically for paraphrasing.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
| from transformers import pipeline
paraphraser = pipeline(
"text2text-generation",
model="humarin/chatgpt_paraphraser_on_T5_base",
)
text = "The quick brown fox jumped over the lazy dog near the riverbank."
result = paraphraser(
f"paraphrase: {text}",
max_length=128,
num_beams=5,
num_return_sequences=3,
)
for i, r in enumerate(result):
print(f"Candidate {i+1}: {r['generated_text']}")
|
The "paraphrase: " prefix is required – this T5 model expects a task prefix to know what you’re asking for. Setting num_return_sequences=3 with beam search gives you multiple candidates to choose from.
PEGASUS-Based Paraphrasing#
PEGASUS was originally designed for summarization, but tuner007/pegasus_paraphrase is fine-tuned for paraphrasing. It tends to produce more conservative rewrites that stay closer to the original structure.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
| from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_name = "tuner007/pegasus_paraphrase"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
text = "Machine learning models require large amounts of labeled data to achieve high accuracy."
inputs = tokenizer(
text,
return_tensors="pt",
truncation=True,
max_length=60,
)
outputs = model.generate(
inputs["input_ids"],
max_length=60,
num_beams=10,
num_return_sequences=5,
temperature=1.5,
)
for i, output in enumerate(outputs):
paraphrase = tokenizer.decode(output, skip_special_tokens=True)
print(f"Candidate {i+1}: {paraphrase}")
|
A few things to note here. num_beams=10 with num_return_sequences=5 means the model explores 10 beam search paths and returns the top 5. The temperature=1.5 increases randomness – higher values produce more diverse paraphrases but risk losing meaning. Start at 1.0 and bump it up if your candidates are too similar to each other.
Manual T5 Paraphrasing with Full Control#
When you need fine-grained control over generation, skip the pipeline and work with the model directly. This lets you tune every parameter.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
| from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch
model_name = "humarin/chatgpt_paraphraser_on_T5_base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
def paraphrase(text: str, num_candidates: int = 5) -> list[str]:
"""Generate multiple paraphrase candidates for input text."""
input_text = f"paraphrase: {text}"
inputs = tokenizer(
input_text,
return_tensors="pt",
truncation=True,
max_length=256,
).to(device)
outputs = model.generate(
**inputs,
max_length=256,
num_beams=num_candidates * 2,
num_return_sequences=num_candidates,
temperature=1.2,
no_repeat_ngram_size=3,
early_stopping=True,
)
candidates = []
for output in outputs:
decoded = tokenizer.decode(output, skip_special_tokens=True)
candidates.append(decoded)
return candidates
text = "Neural networks learn hierarchical representations of data through multiple layers of nonlinear transformations."
results = paraphrase(text, num_candidates=5)
for i, r in enumerate(results):
print(f"{i+1}. {r}")
|
Setting num_beams to double your desired num_return_sequences gives the beam search enough room to find diverse candidates. The no_repeat_ngram_size=3 prevents the model from repeating three-word phrases, which T5 occasionally does.
Score Candidates by Semantic Similarity#
Generating five candidates is only useful if you can pick the best one. Use sentence-transformers to compute cosine similarity between the original and each paraphrase. You want high similarity (meaning preserved) but not 1.0 (that means it just copied the input).
1
| pip install sentence-transformers
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
| from sentence_transformers import SentenceTransformer, util
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch
# Load similarity model
sim_model = SentenceTransformer("all-MiniLM-L6-v2")
# Load paraphrase model
model_name = "humarin/chatgpt_paraphraser_on_T5_base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
def paraphrase_and_rank(text: str, num_candidates: int = 5) -> list[dict]:
"""Generate paraphrases and rank them by semantic similarity."""
input_text = f"paraphrase: {text}"
inputs = tokenizer(input_text, return_tensors="pt", truncation=True, max_length=256)
outputs = model.generate(
**inputs,
max_length=256,
num_beams=num_candidates * 2,
num_return_sequences=num_candidates,
temperature=1.3,
no_repeat_ngram_size=3,
)
candidates = [tokenizer.decode(o, skip_special_tokens=True) for o in outputs]
# Compute similarity scores
original_embedding = sim_model.encode(text, convert_to_tensor=True)
candidate_embeddings = sim_model.encode(candidates, convert_to_tensor=True)
similarities = util.cos_sim(original_embedding, candidate_embeddings)[0]
# Pair candidates with scores and sort
scored = []
for candidate, score in zip(candidates, similarities):
scored.append({
"text": candidate,
"similarity": float(score),
})
scored.sort(key=lambda x: x["similarity"], reverse=True)
return scored
text = "Transfer learning allows models to apply knowledge from one task to improve performance on another."
ranked = paraphrase_and_rank(text, num_candidates=5)
print(f"Original: {text}\n")
for i, item in enumerate(ranked):
print(f"{i+1}. [{item['similarity']:.3f}] {item['text']}")
|
In practice, paraphrases with similarity between 0.75 and 0.95 are the sweet spot. Below 0.75, the meaning has drifted too far. Above 0.95, the rewording is too minor to be useful.
Batch Processing Pipeline#
When you need to paraphrase hundreds or thousands of texts, process them in batches to avoid loading inputs one at a time.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
| from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch
model_name = "humarin/chatgpt_paraphraser_on_T5_base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
def batch_paraphrase(
texts: list[str],
batch_size: int = 8,
num_beams: int = 5,
max_length: int = 256,
) -> list[str]:
"""Paraphrase a list of texts in batches."""
all_paraphrases = []
for i in range(0, len(texts), batch_size):
batch = texts[i : i + batch_size]
prefixed = [f"paraphrase: {t}" for t in batch]
inputs = tokenizer(
prefixed,
return_tensors="pt",
truncation=True,
max_length=max_length,
padding=True,
).to(device)
outputs = model.generate(
**inputs,
max_length=max_length,
num_beams=num_beams,
no_repeat_ngram_size=3,
early_stopping=True,
)
decoded = tokenizer.batch_decode(outputs, skip_special_tokens=True)
all_paraphrases.extend(decoded)
print(f"Processed batch {i // batch_size + 1}/{(len(texts) - 1) // batch_size + 1}")
return all_paraphrases
# Example usage
texts = [
"Deep learning has transformed computer vision in the last decade.",
"Natural language processing enables machines to understand human text.",
"Reinforcement learning trains agents through trial and error.",
"Generative models can create realistic images from text descriptions.",
"Attention mechanisms allow models to focus on relevant parts of the input.",
]
paraphrased = batch_paraphrase(texts, batch_size=2)
for original, rewritten in zip(texts, paraphrased):
print(f"Original: {original}")
print(f"Paraphrased: {rewritten}")
print()
|
Keep batch_size small enough to fit in GPU memory. For T5-base on a 8GB GPU, batch sizes of 8-16 work fine. For PEGASUS, which is larger, drop to 4-8.
T5 vs PEGASUS: When to Use Which#
Both models produce solid paraphrases, but they have different strengths.
Use T5 (humarin/chatgpt_paraphraser_on_T5_base) when:
- You want more creative, diverse rewrites
- You’re doing data augmentation and need variety
- You want a smaller model footprint (T5-base is ~220M parameters)
Use PEGASUS (tuner007/pegasus_paraphrase) when:
- You want conservative rewrites that stay close to the original
- You’re paraphrasing formal or technical text
- You need to preserve sentence structure while changing word choice
For most use cases, start with T5. It’s faster, smaller, and produces more noticeably different paraphrases. Switch to PEGASUS if T5 changes too much.
Common Errors and Fixes#
RuntimeError: CUDA out of memory during beam search
Beam search multiplies memory usage by the number of beams. If you’re generating 10 candidates with num_beams=20, that’s 20x the memory of greedy decoding.
Fix: Lower num_beams, reduce batch_size, or switch to CPU for smaller workloads.
1
2
3
4
5
| # Reduce beam count
outputs = model.generate(**inputs, num_beams=4, num_return_sequences=3)
# Or use half-precision
model = model.half().to("cuda")
|
Output is identical to input text
This happens when the model doesn’t recognize the task. For the T5 paraphrase model, you must include the "paraphrase: " prefix. Without it, T5 just echoes the input.
1
2
3
4
5
| # Wrong -- missing prefix
inputs = tokenizer("Some text here", return_tensors="pt")
# Correct
inputs = tokenizer("paraphrase: Some text here", return_tensors="pt")
|
ValueError: num_return_sequences has to be smaller or equal to num_beams
You asked for more candidates than beam search paths. If num_beams=3 and num_return_sequences=5, there are only 3 beams to return sequences from.
Fix: Set num_beams to at least num_return_sequences. Use 2x for better diversity.
1
2
3
4
5
| # This will error
model.generate(**inputs, num_beams=3, num_return_sequences=5)
# This works
model.generate(**inputs, num_beams=10, num_return_sequences=5)
|
All paraphrase candidates look nearly identical
Low temperature and low beam count produce similar outputs. Increase both to get more variety.
1
2
3
4
5
6
7
| outputs = model.generate(
**inputs,
num_beams=15,
num_return_sequences=5,
temperature=1.5, # higher = more diverse
no_repeat_ngram_size=3,
)
|
Tokenizer returns Token indices sequence length is longer than the specified maximum
Your input text exceeds the model’s max length. T5-base handles up to 512 tokens. PEGASUS paraphrase handles up to 60 tokens by default.
Fix: Always set truncation=True. For longer texts, split into sentences and paraphrase each one individually.
1
| inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
|