Standard CountVectorizer and TfidfVectorizer from scikit-learn extract all n-grams within a given range. That means you get garbage candidates like “the large”, “of a”, and “running into” alongside the real keyphrases. You spend more time filtering noise than extracting signal.
KeyphraseVectorizers fix this by using spaCy part-of-speech tags to select only candidates that match a grammatical pattern – by default, zero or more adjectives followed by one or more nouns (<J.*>*<N.*>+). The result is a document-keyphrase matrix where every column is a grammatically valid phrase.
1
2
| pip install keyphrase-vectorizers keybert spacy scikit-learn
python -m spacy download en_core_web_sm
|
The simplest approach: fit a KeyphraseCountVectorizer on your documents and pull out the keyphrases with their frequency counts.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
| from keyphrase_vectorizers import KeyphraseCountVectorizer
docs = [
"""Transformer architectures have revolutionized natural language processing.
Self-attention mechanisms allow models to capture long-range dependencies
in text sequences. Pre-trained language models like BERT and GPT use
transformer blocks with multi-head attention layers. Fine-tuning these
models on downstream tasks achieves state-of-the-art performance across
sentiment analysis, named entity recognition, and question answering.""",
"""Reinforcement learning agents interact with environments to maximize
cumulative reward signals. Deep reinforcement learning combines neural
networks with policy gradient methods. Proximal policy optimization and
soft actor-critic algorithms have improved training stability. Model-based
reinforcement learning reduces sample complexity by learning environment
dynamics.""",
]
vectorizer = KeyphraseCountVectorizer()
count_matrix = vectorizer.fit_transform(docs)
keyphrases = vectorizer.get_feature_names_out()
print(f"Found {len(keyphrases)} unique keyphrases:\n")
for kp in sorted(keyphrases):
print(f" {kp}")
|
Output looks something like:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
| Found 22 unique keyphrases:
attention layers
attention mechanisms
cumulative reward signals
deep reinforcement learning
downstream tasks
environment dynamics
fine-tuning
language models
language processing
multi-head attention layers
named entity recognition
natural language processing
neural networks
policy gradient methods
pre-trained language models
proximal policy optimization
question answering
reinforcement learning
reinforcement learning agents
sample complexity
self-attention mechanisms
sentiment analysis
soft actor-critic algorithms
text sequences
training stability
transformer architectures
transformer blocks
|
Every extracted keyphrase is a proper noun phrase. No junk n-grams.
TF-IDF Scoring with KeyphraseTfidfVectorizer#
Frequency counts alone don’t tell you which keyphrases are distinctive. A term that appears in every document isn’t very useful. KeyphraseTfidfVectorizer applies TF-IDF weighting so keyphrases that are specific to individual documents score higher.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| from keyphrase_vectorizers import KeyphraseTfidfVectorizer
import numpy as np
tfidf_vectorizer = KeyphraseTfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(docs)
feature_names = tfidf_vectorizer.get_feature_names_out()
# Show top keyphrases per document
for doc_idx in range(tfidf_matrix.shape[0]):
scores = tfidf_matrix[doc_idx].toarray().flatten()
top_indices = np.argsort(scores)[::-1][:5]
print(f"\nDocument {doc_idx + 1} - Top Keyphrases:")
for idx in top_indices:
if scores[idx] > 0:
print(f" {scores[idx]:.4f} {feature_names[idx]}")
|
This gives you ranked keyphrases per document where rare, document-specific phrases score higher than common ones.
The default pattern <J.*>*<N.*>+ captures adjective-noun sequences. You can customize this to extract different phrase structures.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
| # Default: adjectives + nouns (e.g., "deep reinforcement learning")
default_vectorizer = KeyphraseCountVectorizer(
pos_pattern="<J.*>*<N.*>+"
)
# Nouns only -- no adjective modifiers (e.g., "learning", "networks")
noun_only = KeyphraseCountVectorizer(
pos_pattern="<N.*>+"
)
# Verbs followed by nouns (e.g., "maximize reward", "capture dependencies")
verb_noun = KeyphraseCountVectorizer(
pos_pattern="<V.*><N.*>+"
)
# Adjectives + nouns + prepositions + nouns (e.g., "attention layers of transformer")
complex_pattern = KeyphraseCountVectorizer(
pos_pattern="<J.*>*<N.*>+<IN><N.*>+"
)
# Compare results
for name, vec in [("default", default_vectorizer), ("noun_only", noun_only),
("verb_noun", verb_noun), ("complex", complex_pattern)]:
vec.fit(docs)
phrases = vec.get_feature_names_out()
print(f"\n{name} ({len(phrases)} keyphrases): {list(phrases)[:5]}")
|
The POS pattern uses a simple regex-like syntax where tags inside angle brackets map to spaCy’s POS tags:
<N.*> – any noun (NN, NNS, NNP, NNPS)<J.*> – any adjective (JJ, JJR, JJS)<V.*> – any verb (VB, VBD, VBG, etc.)<IN> – preposition or subordinating conjunction* after a group means zero or more, + means one or more
Combining with KeyBERT for Embedding-Based Ranking#
Here’s where it gets powerful. KeyBERT normally generates candidates using scikit-learn’s CountVectorizer with fixed n-gram ranges. When you swap in a KeyphraseCountVectorizer, candidates are grammatically valid phrases and KeyBERT ranks them by semantic similarity to the full document. This combination is called the PatternRank approach.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
| from keybert import KeyBERT
from keyphrase_vectorizers import KeyphraseCountVectorizer
kw_model = KeyBERT(model="all-MiniLM-L6-v2")
vectorizer = KeyphraseCountVectorizer()
doc = """
Transfer learning has transformed computer vision by allowing practitioners to
fine-tune pre-trained convolutional neural networks on small domain-specific datasets.
Models like ResNet, EfficientNet, and Vision Transformers achieve high accuracy with
minimal training data. Data augmentation techniques including random cropping, color
jittering, and mixup further reduce overfitting. Feature extraction from intermediate
layers provides rich visual representations for downstream classification, object
detection, and semantic segmentation tasks.
"""
# Extract keyphrases ranked by embedding similarity
keywords = kw_model.extract_keywords(
doc,
vectorizer=vectorizer,
top_n=10,
)
print("KeyBERT + KeyphraseVectorizer (PatternRank):\n")
for kw, score in keywords:
print(f" {score:.4f} {kw}")
|
You can also apply MMR diversification to reduce redundancy among the top keyphrases:
1
2
3
4
5
6
7
8
9
10
11
| diverse_keywords = kw_model.extract_keywords(
doc,
vectorizer=vectorizer,
use_mmr=True,
diversity=0.6,
top_n=10,
)
print("With MMR diversity:\n")
for kw, score in diverse_keywords:
print(f" {score:.4f} {kw}")
|
TF-IDF vs Embedding-Based Scoring#
The two approaches serve different needs:
| Method | How It Scores | Best For |
|---|
| KeyphraseTfidfVectorizer | Term frequency weighted by inverse document frequency | Corpus-level analysis, finding document-specific terms across a collection |
| KeyBERT + KeyphraseCountVectorizer | Cosine similarity between keyphrase embedding and document embedding | Single-document extraction, semantic relevance ranking |
TF-IDF needs multiple documents to calculate meaningful IDF weights. If you only have one document, every keyphrase gets the same IDF score and you’re left with raw frequency counts. KeyBERT works well on individual documents because it compares keyphrase embeddings against the full document embedding.
For a collection of documents where you want to find what makes each one unique, TF-IDF is the right choice. For pulling the most representative keyphrases from a single document regardless of a corpus, use KeyBERT.
Batch Processing Pipeline#
When you have hundreds or thousands of documents, build a pipeline that reuses the spaCy model and processes documents efficiently.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
| import spacy
from keybert import KeyBERT
from keyphrase_vectorizers import KeyphraseCountVectorizer, KeyphraseTfidfVectorizer
import numpy as np
# Load spaCy once and share across vectorizers
nlp = spacy.load("en_core_web_sm")
def extract_keyphrases_batch(
documents: list[dict],
method: str = "keybert",
top_n: int = 10,
) -> list[dict]:
"""
Extract keyphrases from a list of documents.
Args:
documents: List of dicts with 'id' and 'text' keys.
method: 'keybert' for embedding-based, 'tfidf' for TF-IDF scoring.
top_n: Number of keyphrases to return per document.
Returns:
List of dicts with 'id' and 'keyphrases' keys.
"""
texts = [doc["text"] for doc in documents]
if method == "tfidf":
vectorizer = KeyphraseTfidfVectorizer(spacy_pipeline=nlp)
tfidf_matrix = vectorizer.fit_transform(texts)
feature_names = vectorizer.get_feature_names_out()
results = []
for i, doc in enumerate(documents):
scores = tfidf_matrix[i].toarray().flatten()
top_indices = np.argsort(scores)[::-1][:top_n]
keyphrases = [
{"keyphrase": feature_names[j], "score": round(float(scores[j]), 4)}
for j in top_indices if scores[j] > 0
]
results.append({"id": doc["id"], "keyphrases": keyphrases})
return results
# KeyBERT method
kw_model = KeyBERT(model="all-MiniLM-L6-v2")
vectorizer = KeyphraseCountVectorizer(spacy_pipeline=nlp)
all_keywords = kw_model.extract_keywords(
texts,
vectorizer=vectorizer,
use_mmr=True,
diversity=0.5,
top_n=top_n,
)
# Handle single document case (KeyBERT returns flat list)
if len(documents) == 1:
all_keywords = [all_keywords]
results = []
for doc, keywords in zip(documents, all_keywords):
keyphrases = [
{"keyphrase": kw, "score": round(score, 4)}
for kw, score in keywords
]
results.append({"id": doc["id"], "keyphrases": keyphrases})
return results
# Example usage
batch = [
{
"id": "doc-001",
"text": "Graph neural networks aggregate neighborhood features through message "
"passing layers. Graph attention networks use attention coefficients to "
"weight neighbor contributions. Applications include molecular property "
"prediction, social network analysis, and recommendation systems.",
},
{
"id": "doc-002",
"text": "Federated learning enables training machine learning models across "
"decentralized devices without sharing raw data. Differential privacy "
"mechanisms add noise to gradients to protect individual data points. "
"Communication efficiency is improved through gradient compression and "
"periodic model aggregation.",
},
{
"id": "doc-003",
"text": "Diffusion models generate high-quality images by learning to reverse "
"a noise process. Denoising score matching trains the model to predict "
"noise at each timestep. Classifier-free guidance controls the trade-off "
"between sample quality and diversity during generation.",
},
]
# Compare both methods
for method in ["tfidf", "keybert"]:
print(f"\n=== {method.upper()} ===")
results = extract_keyphrases_batch(batch, method=method, top_n=5)
for r in results:
print(f"\n{r['id']}:")
for kp in r["keyphrases"]:
print(f" {kp['score']:.4f} {kp['keyphrase']}")
|
Controlling Stop Words and Filtering#
The vectorizers accept custom stop word lists to remove domain-specific noise:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
| from keyphrase_vectorizers import KeyphraseCountVectorizer
# Add domain-specific stop words on top of English defaults
custom_stops = [
"et al", "fig", "figure", "table", "section", "proposed method",
"experimental results", "state art",
]
vectorizer = KeyphraseCountVectorizer(
stop_words=custom_stops,
lowercase=True,
min_df=2, # Keyphrase must appear in at least 2 documents
max_df=None, # No upper limit
)
vectorizer.fit(docs)
keyphrases = vectorizer.get_feature_names_out()
print(f"Filtered keyphrases: {list(keyphrases)}")
|
Setting min_df=2 removes keyphrases that only appear in a single document, which cuts down on noise when processing a corpus. You can also use max_df to filter out keyphrases that appear too frequently.
Common Errors and Fixes#
OSError: [E050] Can't find model 'en_core_web_sm'
You need to download the spaCy model separately. The pip install doesn’t do it:
1
| python -m spacy download en_core_web_sm
|
For larger documents that need more accurate POS tagging, use en_core_web_md or en_core_web_lg and pass it to the vectorizer:
1
| vectorizer = KeyphraseCountVectorizer(spacy_pipeline="en_core_web_lg")
|
Empty keyphrases returned
This usually means the POS pattern doesn’t match anything in your text. Check that your text actually contains the expected parts of speech. Very short texts (under 10 words) or texts with mostly verbs and prepositions won’t match the default noun-phrase pattern. Try a broader pattern:
1
| vectorizer = KeyphraseCountVectorizer(pos_pattern="<N.*>+")
|
ValueError: empty vocabulary when calling transform()
This happens when you call transform() on documents that have no keyphrases matching the vocabulary learned during fit(). Make sure your training and test documents share similar vocabulary. Or use fit_transform() on the combined corpus.
Memory issues with large corpora
The spaCy pipeline loads into memory for each vectorizer instance. When creating multiple vectorizers, share a single spaCy model:
1
2
3
4
5
| import spacy
nlp = spacy.load("en_core_web_sm")
count_vec = KeyphraseCountVectorizer(spacy_pipeline=nlp)
tfidf_vec = KeyphraseTfidfVectorizer(spacy_pipeline=nlp)
|
KeyBERT returns single-word keyphrases despite using KeyphraseCountVectorizer
If your documents are very short, the vectorizer might only find single-word noun candidates. Provide more text, or adjust the pattern to be less restrictive. Also confirm you’re passing the vectorizer parameter correctly – it’s vectorizer=, not keyphrase_ngram_range= when using KeyphraseVectorizers.
Slow processing on large batches
Set workers to use multiple CPU cores for spaCy processing:
1
| vectorizer = KeyphraseCountVectorizer(workers=4)
|
Also consider using spacy_exclude to skip pipeline components you don’t need:
1
2
3
| vectorizer = KeyphraseCountVectorizer(
spacy_exclude=["parser", "attribute_ruler", "lemmatizer", "ner", "textcat"]
)
|
This is the default, but if you’ve overridden it, make sure you’re excluding unnecessary components.