How to Build a Keyphrase Generation Pipeline with KeyphraseVectorizers

Why POS-Based Keyphrase Extraction Beats Fixed N-Grams

Standard CountVectorizer and TfidfVectorizer from scikit-learn extract all n-grams within a given range. That means you get garbage candidates like “the large”, “of a”, and “running into” alongside the real keyphrases. You spend more time filtering noise than extracting signal.

KeyphraseVectorizers fix this by using spaCy part-of-speech tags to select only candidates that match a grammatical pattern – by default, zero or more adjectives followed by one or more nouns (<J.*>*<N.*>+). The result is a document-keyphrase matrix where every column is a grammatically valid phrase.

1
2
pip install keyphrase-vectorizers keybert spacy scikit-learn
python -m spacy download en_core_web_sm

Extracting Keyphrases with KeyphraseCountVectorizer

The simplest approach: fit a KeyphraseCountVectorizer on your documents and pull out the keyphrases with their frequency counts.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from keyphrase_vectorizers import KeyphraseCountVectorizer

docs = [
    """Transformer architectures have revolutionized natural language processing.
    Self-attention mechanisms allow models to capture long-range dependencies
    in text sequences. Pre-trained language models like BERT and GPT use
    transformer blocks with multi-head attention layers. Fine-tuning these
    models on downstream tasks achieves state-of-the-art performance across
    sentiment analysis, named entity recognition, and question answering.""",

    """Reinforcement learning agents interact with environments to maximize
    cumulative reward signals. Deep reinforcement learning combines neural
    networks with policy gradient methods. Proximal policy optimization and
    soft actor-critic algorithms have improved training stability. Model-based
    reinforcement learning reduces sample complexity by learning environment
    dynamics.""",
]

vectorizer = KeyphraseCountVectorizer()
count_matrix = vectorizer.fit_transform(docs)
keyphrases = vectorizer.get_feature_names_out()

print(f"Found {len(keyphrases)} unique keyphrases:\n")
for kp in sorted(keyphrases):
    print(f"  {kp}")

Output looks something like:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
Found 22 unique keyphrases:

  attention layers
  attention mechanisms
  cumulative reward signals
  deep reinforcement learning
  downstream tasks
  environment dynamics
  fine-tuning
  language models
  language processing
  multi-head attention layers
  named entity recognition
  natural language processing
  neural networks
  policy gradient methods
  pre-trained language models
  proximal policy optimization
  question answering
  reinforcement learning
  reinforcement learning agents
  sample complexity
  self-attention mechanisms
  sentiment analysis
  soft actor-critic algorithms
  text sequences
  training stability
  transformer architectures
  transformer blocks

Every extracted keyphrase is a proper noun phrase. No junk n-grams.

TF-IDF Scoring with KeyphraseTfidfVectorizer

Frequency counts alone don’t tell you which keyphrases are distinctive. A term that appears in every document isn’t very useful. KeyphraseTfidfVectorizer applies TF-IDF weighting so keyphrases that are specific to individual documents score higher.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
from keyphrase_vectorizers import KeyphraseTfidfVectorizer
import numpy as np

tfidf_vectorizer = KeyphraseTfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(docs)
feature_names = tfidf_vectorizer.get_feature_names_out()

# Show top keyphrases per document
for doc_idx in range(tfidf_matrix.shape[0]):
    scores = tfidf_matrix[doc_idx].toarray().flatten()
    top_indices = np.argsort(scores)[::-1][:5]
    print(f"\nDocument {doc_idx + 1} - Top Keyphrases:")
    for idx in top_indices:
        if scores[idx] > 0:
            print(f"  {scores[idx]:.4f}  {feature_names[idx]}")

This gives you ranked keyphrases per document where rare, document-specific phrases score higher than common ones.

Custom POS Patterns for Targeted Extraction

The default pattern <J.*>*<N.*>+ captures adjective-noun sequences. You can customize this to extract different phrase structures.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# Default: adjectives + nouns (e.g., "deep reinforcement learning")
default_vectorizer = KeyphraseCountVectorizer(
    pos_pattern="<J.*>*<N.*>+"
)

# Nouns only -- no adjective modifiers (e.g., "learning", "networks")
noun_only = KeyphraseCountVectorizer(
    pos_pattern="<N.*>+"
)

# Verbs followed by nouns (e.g., "maximize reward", "capture dependencies")
verb_noun = KeyphraseCountVectorizer(
    pos_pattern="<V.*><N.*>+"
)

# Adjectives + nouns + prepositions + nouns (e.g., "attention layers of transformer")
complex_pattern = KeyphraseCountVectorizer(
    pos_pattern="<J.*>*<N.*>+<IN><N.*>+"
)

# Compare results
for name, vec in [("default", default_vectorizer), ("noun_only", noun_only),
                  ("verb_noun", verb_noun), ("complex", complex_pattern)]:
    vec.fit(docs)
    phrases = vec.get_feature_names_out()
    print(f"\n{name} ({len(phrases)} keyphrases): {list(phrases)[:5]}")

The POS pattern uses a simple regex-like syntax where tags inside angle brackets map to spaCy’s POS tags:

<N.*> – any noun (NN, NNS, NNP, NNPS)
<J.*> – any adjective (JJ, JJR, JJS)
<V.*> – any verb (VB, VBD, VBG, etc.)
<IN> – preposition or subordinating conjunction
* after a group means zero or more, + means one or more

Combining with KeyBERT for Embedding-Based Ranking

Here’s where it gets powerful. KeyBERT normally generates candidates using scikit-learn’s CountVectorizer with fixed n-gram ranges. When you swap in a KeyphraseCountVectorizer, candidates are grammatically valid phrases and KeyBERT ranks them by semantic similarity to the full document. This combination is called the PatternRank approach.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from keybert import KeyBERT
from keyphrase_vectorizers import KeyphraseCountVectorizer

kw_model = KeyBERT(model="all-MiniLM-L6-v2")
vectorizer = KeyphraseCountVectorizer()

doc = """
Transfer learning has transformed computer vision by allowing practitioners to
fine-tune pre-trained convolutional neural networks on small domain-specific datasets.
Models like ResNet, EfficientNet, and Vision Transformers achieve high accuracy with
minimal training data. Data augmentation techniques including random cropping, color
jittering, and mixup further reduce overfitting. Feature extraction from intermediate
layers provides rich visual representations for downstream classification, object
detection, and semantic segmentation tasks.
"""

# Extract keyphrases ranked by embedding similarity
keywords = kw_model.extract_keywords(
    doc,
    vectorizer=vectorizer,
    top_n=10,
)

print("KeyBERT + KeyphraseVectorizer (PatternRank):\n")
for kw, score in keywords:
    print(f"  {score:.4f}  {kw}")

You can also apply MMR diversification to reduce redundancy among the top keyphrases:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
diverse_keywords = kw_model.extract_keywords(
    doc,
    vectorizer=vectorizer,
    use_mmr=True,
    diversity=0.6,
    top_n=10,
)

print("With MMR diversity:\n")
for kw, score in diverse_keywords:
    print(f"  {score:.4f}  {kw}")

TF-IDF vs Embedding-Based Scoring

The two approaches serve different needs:

Method	How It Scores	Best For
KeyphraseTfidfVectorizer	Term frequency weighted by inverse document frequency	Corpus-level analysis, finding document-specific terms across a collection
KeyBERT + KeyphraseCountVectorizer	Cosine similarity between keyphrase embedding and document embedding	Single-document extraction, semantic relevance ranking

TF-IDF needs multiple documents to calculate meaningful IDF weights. If you only have one document, every keyphrase gets the same IDF score and you’re left with raw frequency counts. KeyBERT works well on individual documents because it compares keyphrase embeddings against the full document embedding.

For a collection of documents where you want to find what makes each one unique, TF-IDF is the right choice. For pulling the most representative keyphrases from a single document regardless of a corpus, use KeyBERT.

Batch Processing Pipeline

When you have hundreds or thousands of documents, build a pipeline that reuses the spaCy model and processes documents efficiently.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
import spacy
from keybert import KeyBERT
from keyphrase_vectorizers import KeyphraseCountVectorizer, KeyphraseTfidfVectorizer
import numpy as np

# Load spaCy once and share across vectorizers
nlp = spacy.load("en_core_web_sm")

def extract_keyphrases_batch(
    documents: list[dict],
    method: str = "keybert",
    top_n: int = 10,
) -> list[dict]:
    """
    Extract keyphrases from a list of documents.

    Args:
        documents: List of dicts with 'id' and 'text' keys.
        method: 'keybert' for embedding-based, 'tfidf' for TF-IDF scoring.
        top_n: Number of keyphrases to return per document.

    Returns:
        List of dicts with 'id' and 'keyphrases' keys.
    """
    texts = [doc["text"] for doc in documents]

    if method == "tfidf":
        vectorizer = KeyphraseTfidfVectorizer(spacy_pipeline=nlp)
        tfidf_matrix = vectorizer.fit_transform(texts)
        feature_names = vectorizer.get_feature_names_out()

        results = []
        for i, doc in enumerate(documents):
            scores = tfidf_matrix[i].toarray().flatten()
            top_indices = np.argsort(scores)[::-1][:top_n]
            keyphrases = [
                {"keyphrase": feature_names[j], "score": round(float(scores[j]), 4)}
                for j in top_indices if scores[j] > 0
            ]
            results.append({"id": doc["id"], "keyphrases": keyphrases})
        return results

    # KeyBERT method
    kw_model = KeyBERT(model="all-MiniLM-L6-v2")
    vectorizer = KeyphraseCountVectorizer(spacy_pipeline=nlp)

    all_keywords = kw_model.extract_keywords(
        texts,
        vectorizer=vectorizer,
        use_mmr=True,
        diversity=0.5,
        top_n=top_n,
    )

    # Handle single document case (KeyBERT returns flat list)
    if len(documents) == 1:
        all_keywords = [all_keywords]

    results = []
    for doc, keywords in zip(documents, all_keywords):
        keyphrases = [
            {"keyphrase": kw, "score": round(score, 4)}
            for kw, score in keywords
        ]
        results.append({"id": doc["id"], "keyphrases": keyphrases})
    return results


# Example usage
batch = [
    {
        "id": "doc-001",
        "text": "Graph neural networks aggregate neighborhood features through message "
                "passing layers. Graph attention networks use attention coefficients to "
                "weight neighbor contributions. Applications include molecular property "
                "prediction, social network analysis, and recommendation systems.",
    },
    {
        "id": "doc-002",
        "text": "Federated learning enables training machine learning models across "
                "decentralized devices without sharing raw data. Differential privacy "
                "mechanisms add noise to gradients to protect individual data points. "
                "Communication efficiency is improved through gradient compression and "
                "periodic model aggregation.",
    },
    {
        "id": "doc-003",
        "text": "Diffusion models generate high-quality images by learning to reverse "
                "a noise process. Denoising score matching trains the model to predict "
                "noise at each timestep. Classifier-free guidance controls the trade-off "
                "between sample quality and diversity during generation.",
    },
]

# Compare both methods
for method in ["tfidf", "keybert"]:
    print(f"\n=== {method.upper()} ===")
    results = extract_keyphrases_batch(batch, method=method, top_n=5)
    for r in results:
        print(f"\n{r['id']}:")
        for kp in r["keyphrases"]:
            print(f"  {kp['score']:.4f}  {kp['keyphrase']}")

Controlling Stop Words and Filtering

The vectorizers accept custom stop word lists to remove domain-specific noise:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
from keyphrase_vectorizers import KeyphraseCountVectorizer

# Add domain-specific stop words on top of English defaults
custom_stops = [
    "et al", "fig", "figure", "table", "section", "proposed method",
    "experimental results", "state art",
]

vectorizer = KeyphraseCountVectorizer(
    stop_words=custom_stops,
    lowercase=True,
    min_df=2,     # Keyphrase must appear in at least 2 documents
    max_df=None,  # No upper limit
)

vectorizer.fit(docs)
keyphrases = vectorizer.get_feature_names_out()
print(f"Filtered keyphrases: {list(keyphrases)}")

Setting min_df=2 removes keyphrases that only appear in a single document, which cuts down on noise when processing a corpus. You can also use max_df to filter out keyphrases that appear too frequently.

Common Errors and Fixes

OSError: [E050] Can't find model 'en_core_web_sm'

You need to download the spaCy model separately. The pip install doesn’t do it:

1
python -m spacy download en_core_web_sm

For larger documents that need more accurate POS tagging, use en_core_web_md or en_core_web_lg and pass it to the vectorizer:

1
vectorizer = KeyphraseCountVectorizer(spacy_pipeline="en_core_web_lg")

Empty keyphrases returned

This usually means the POS pattern doesn’t match anything in your text. Check that your text actually contains the expected parts of speech. Very short texts (under 10 words) or texts with mostly verbs and prepositions won’t match the default noun-phrase pattern. Try a broader pattern:

1
vectorizer = KeyphraseCountVectorizer(pos_pattern="<N.*>+")

ValueError: empty vocabulary when calling transform()

This happens when you call transform() on documents that have no keyphrases matching the vocabulary learned during fit(). Make sure your training and test documents share similar vocabulary. Or use fit_transform() on the combined corpus.

Memory issues with large corpora

The spaCy pipeline loads into memory for each vectorizer instance. When creating multiple vectorizers, share a single spaCy model:

1
2
3
4
5
import spacy

nlp = spacy.load("en_core_web_sm")
count_vec = KeyphraseCountVectorizer(spacy_pipeline=nlp)
tfidf_vec = KeyphraseTfidfVectorizer(spacy_pipeline=nlp)

KeyBERT returns single-word keyphrases despite using KeyphraseCountVectorizer

If your documents are very short, the vectorizer might only find single-word noun candidates. Provide more text, or adjust the pattern to be less restrictive. Also confirm you’re passing the vectorizer parameter correctly – it’s vectorizer=, not keyphrase_ngram_range= when using KeyphraseVectorizers.

Slow processing on large batches

Set workers to use multiple CPU cores for spaCy processing:

1
vectorizer = KeyphraseCountVectorizer(workers=4)

Also consider using spacy_exclude to skip pipeline components you don’t need:

1
2
3
vectorizer = KeyphraseCountVectorizer(
    spacy_exclude=["parser", "attribute_ruler", "lemmatizer", "ner", "textcat"]
)

This is the default, but if you’ve overridden it, make sure you’re excluding unnecessary components.

Why POS-Based Keyphrase Extraction Beats Fixed N-Grams#

Extracting Keyphrases with KeyphraseCountVectorizer#

TF-IDF Scoring with KeyphraseTfidfVectorizer#

Custom POS Patterns for Targeted Extraction#

Combining with KeyBERT for Embedding-Based Ranking#

TF-IDF vs Embedding-Based Scoring#

Batch Processing Pipeline#

Controlling Stop Words and Filtering#

Common Errors and Fixes#

Related Guides#

About the Author