How to Extract Keywords and Key Phrases from Text with KeyBERT

The Quick Version

KeyBERT uses sentence embeddings to find words and phrases that best represent a document’s content. Unlike TF-IDF which counts word frequencies, KeyBERT understands meaning — it knows that “neural network” and “deep learning model” are related, even though they share no words.

1
pip install keybert

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
from keybert import KeyBERT

kw_model = KeyBERT()

doc = """
Retrieval-Augmented Generation (RAG) combines large language models with external
knowledge bases to reduce hallucinations and provide grounded answers. The system
retrieves relevant documents using vector similarity search, then feeds them as
context to the LLM. This approach is particularly effective for enterprise
applications where accuracy and source attribution matter. Popular frameworks
like LangChain and LlamaIndex simplify building RAG pipelines with built-in
retriever and generator components.
"""

keywords = kw_model.extract_keywords(doc, top_n=10)
for kw, score in keywords:
    print(f"{score:.4f}  {kw}")

Output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
0.6834  retrieval augmented generation
0.5421  rag pipelines
0.5318  language models
0.5102  vector similarity search
0.4987  knowledge bases
0.4823  hallucinations
0.4756  langchain
0.4612  llamaindex
0.4521  enterprise applications
0.4389  source attribution

The scores represent cosine similarity between each candidate keyword and the full document embedding. Higher scores mean the keyword is more representative of the document’s content.

Extracting Multi-Word Phrases

By default KeyBERT extracts individual words. For more meaningful results, use n-gram ranges to capture phrases:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# Extract 1-3 word phrases
keywords = kw_model.extract_keywords(
    doc,
    keyphrase_ngram_range=(1, 3),
    top_n=10,
    stop_words="english",
)

for kw, score in keywords:
    print(f"{score:.4f}  {kw}")

Diversifying Results with MMR

KeyBERT can return redundant keywords (“machine learning”, “machine learning models”, “learning models”). Use Maximal Marginal Relevance (MMR) to get diverse results:

1
2
3
4
5
6
7
8
9
# MMR reduces redundancy between extracted keywords
keywords = kw_model.extract_keywords(
    doc,
    keyphrase_ngram_range=(1, 3),
    stop_words="english",
    use_mmr=True,
    diversity=0.7,   # 0 = no diversity, 1 = max diversity
    top_n=10,
)

A diversity of 0.5-0.7 works well for most documents. Lower values give more relevant but potentially redundant keywords. Higher values spread keywords across different subtopics.

Using Different Embedding Models

KeyBERT’s default model works fine for general English text. For domain-specific documents, switch to a specialized embedding model:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
from keybert import KeyBERT
from sentence_transformers import SentenceTransformer

# Scientific/technical text
model = SentenceTransformer("allenai/scibert_scivocab_uncased")
kw_model = KeyBERT(model=model)

# Multilingual documents
model = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2")
kw_model = KeyBERT(model=model)

# Fast extraction for large batches
model = SentenceTransformer("all-MiniLM-L6-v2")
kw_model = KeyBERT(model=model)

You can also use OpenAI embeddings for higher quality extraction:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
from keybert import KeyBERT
import openai

class OpenAIBackend:
    def __init__(self, model: str = "text-embedding-3-small"):
        self.client = openai.OpenAI()
        self.model = model

    def embed(self, documents: list[str], verbose=False) -> list[list[float]]:
        response = self.client.embeddings.create(model=self.model, input=documents)
        return [d.embedding for d in response.data]

kw_model = KeyBERT(model=OpenAIBackend())

Batch Processing for Large Document Sets

When processing thousands of documents, extract keywords in batches for efficiency:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import json
from pathlib import Path

def extract_keywords_batch(
    documents: list[dict],
    kw_model: KeyBERT,
    top_n: int = 10,
) -> list[dict]:
    """Extract keywords from a batch of documents."""
    texts = [doc["text"] for doc in documents]

    # KeyBERT handles batching internally
    all_keywords = kw_model.extract_keywords(
        texts,
        keyphrase_ngram_range=(1, 3),
        stop_words="english",
        use_mmr=True,
        diversity=0.5,
        top_n=top_n,
    )

    results = []
    for doc, keywords in zip(documents, all_keywords):
        results.append({
            "id": doc["id"],
            "title": doc.get("title", ""),
            "keywords": [{"keyword": kw, "score": round(score, 4)} for kw, score in keywords],
        })
    return results

# Process a batch
documents = [
    {"id": "1", "title": "RAG Guide", "text": "Retrieval-augmented generation..."},
    {"id": "2", "title": "Fine-tuning", "text": "LoRA fine-tuning of large..."},
]

results = extract_keywords_batch(documents, kw_model)
for r in results:
    print(f"\n{r['title']}:")
    for kw in r["keywords"][:5]:
        print(f"  {kw['score']}  {kw['keyword']}")

Comparing with TF-IDF and RAKE

KeyBERT isn’t the only option. Here’s how the main keyword extraction methods compare:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# Method 1: TF-IDF (frequency-based, no semantic understanding)
from sklearn.feature_extraction.text import TfidfVectorizer

def tfidf_keywords(documents: list[str], top_n: int = 10) -> list[list[tuple]]:
    vectorizer = TfidfVectorizer(
        max_features=1000,
        stop_words="english",
        ngram_range=(1, 3),
    )
    tfidf_matrix = vectorizer.fit_transform(documents)
    feature_names = vectorizer.get_feature_names_out()

    results = []
    for i in range(tfidf_matrix.shape[0]):
        row = tfidf_matrix[i].toarray().flatten()
        top_indices = row.argsort()[-top_n:][::-1]
        keywords = [(feature_names[j], round(row[j], 4)) for j in top_indices if row[j] > 0]
        results.append(keywords)
    return results

# Method 2: RAKE (Rapid Automatic Keyword Extraction)
from rake_nltk import Rake

def rake_keywords(text: str, top_n: int = 10) -> list[tuple]:
    rake = Rake()
    rake.extract_keywords_from_text(text)
    return rake.get_ranked_phrases_with_scores()[:top_n]

# Compare all three
doc = "Your document text here..."

print("KeyBERT:")
for kw, score in kw_model.extract_keywords(doc, top_n=5):
    print(f"  {score:.4f}  {kw}")

print("\nTF-IDF:")
for kw, score in tfidf_keywords([doc], top_n=5)[0]:
    print(f"  {score:.4f}  {kw}")

print("\nRAKE:")
for score, kw in rake_keywords(doc, top_n=5):
    print(f"  {score:.2f}  {kw}")

Method	Pros	Cons	Best For
KeyBERT	Semantic understanding, handles synonyms	Slower, needs GPU for large batches	Single documents, quality over speed
TF-IDF	Fast, works well on document collections	No semantic understanding, needs a corpus	Large document sets, corpus-level analysis
RAKE	No training needed, very fast	Often noisy results, no ranking by relevance	Quick extraction, preprocessing step

Guided Keyword Extraction

When you know the domain, guide KeyBERT with seed keywords to focus extraction on relevant terms:

1
2
3
4
5
6
7
8
9
# Seed keywords steer extraction toward your domain
seed_keywords = ["machine learning", "neural network", "training", "model"]

keywords = kw_model.extract_keywords(
    doc,
    seed_keywords=seed_keywords,
    keyphrase_ngram_range=(1, 3),
    top_n=10,
)

Seed keywords bias the similarity scoring toward terms related to your seeds. This helps when documents cover multiple topics and you want keywords from a specific angle.

Common Errors and Fixes

Keywords are too generic (“data”, “system”, “method”)

Increase the n-gram range to (2, 3) so KeyBERT extracts phrases instead of single words. Also add domain-specific stop words:

1
2
3
4
custom_stops = list(set(
    ["data", "system", "method", "approach", "based", "using", "paper", "proposed"]
))
keywords = kw_model.extract_keywords(doc, stop_words=custom_stops)

Slow on long documents

KeyBERT embeds the full document and every candidate keyword. For documents over 5000 words, truncate or split into sections and extract keywords per section, then deduplicate.

Empty results on short text

KeyBERT needs enough text to generate meaningful candidates. For tweets or short messages, lower the n-gram range to (1, 1) and reduce top_n. Below 20 words, consider using entity extraction instead.

Different results on same document

Some embedding models are non-deterministic. Set a random seed: torch.manual_seed(42) before extraction. Or use a deterministic model like all-MiniLM-L6-v2.

KeyBERT misses domain-specific jargon

The default embedding model may not understand specialized terminology. Switch to a domain-specific model (SciBERT for science, BioBERT for medical, FinBERT for finance) or use OpenAI embeddings which have broader vocabulary coverage.

The Quick Version#

Extracting Multi-Word Phrases#

Diversifying Results with MMR#

Using Different Embedding Models#

Batch Processing for Large Document Sets#

Comparing with TF-IDF and RAKE#

Guided Keyword Extraction#

Common Errors and Fixes#

Related Guides#

About the Author