The Quick Version#
KeyBERT uses sentence embeddings to find words and phrases that best represent a document’s content. Unlike TF-IDF which counts word frequencies, KeyBERT understands meaning — it knows that “neural network” and “deep learning model” are related, even though they share no words.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| from keybert import KeyBERT
kw_model = KeyBERT()
doc = """
Retrieval-Augmented Generation (RAG) combines large language models with external
knowledge bases to reduce hallucinations and provide grounded answers. The system
retrieves relevant documents using vector similarity search, then feeds them as
context to the LLM. This approach is particularly effective for enterprise
applications where accuracy and source attribution matter. Popular frameworks
like LangChain and LlamaIndex simplify building RAG pipelines with built-in
retriever and generator components.
"""
keywords = kw_model.extract_keywords(doc, top_n=10)
for kw, score in keywords:
print(f"{score:.4f} {kw}")
|
Output:
1
2
3
4
5
6
7
8
9
10
| 0.6834 retrieval augmented generation
0.5421 rag pipelines
0.5318 language models
0.5102 vector similarity search
0.4987 knowledge bases
0.4823 hallucinations
0.4756 langchain
0.4612 llamaindex
0.4521 enterprise applications
0.4389 source attribution
|
The scores represent cosine similarity between each candidate keyword and the full document embedding. Higher scores mean the keyword is more representative of the document’s content.
By default KeyBERT extracts individual words. For more meaningful results, use n-gram ranges to capture phrases:
1
2
3
4
5
6
7
8
9
10
| # Extract 1-3 word phrases
keywords = kw_model.extract_keywords(
doc,
keyphrase_ngram_range=(1, 3),
top_n=10,
stop_words="english",
)
for kw, score in keywords:
print(f"{score:.4f} {kw}")
|
Diversifying Results with MMR#
KeyBERT can return redundant keywords (“machine learning”, “machine learning models”, “learning models”). Use Maximal Marginal Relevance (MMR) to get diverse results:
1
2
3
4
5
6
7
8
9
| # MMR reduces redundancy between extracted keywords
keywords = kw_model.extract_keywords(
doc,
keyphrase_ngram_range=(1, 3),
stop_words="english",
use_mmr=True,
diversity=0.7, # 0 = no diversity, 1 = max diversity
top_n=10,
)
|
A diversity of 0.5-0.7 works well for most documents. Lower values give more relevant but potentially redundant keywords. Higher values spread keywords across different subtopics.
Using Different Embedding Models#
KeyBERT’s default model works fine for general English text. For domain-specific documents, switch to a specialized embedding model:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| from keybert import KeyBERT
from sentence_transformers import SentenceTransformer
# Scientific/technical text
model = SentenceTransformer("allenai/scibert_scivocab_uncased")
kw_model = KeyBERT(model=model)
# Multilingual documents
model = SentenceTransformer("paraphrase-multilingual-MiniLM-L12-v2")
kw_model = KeyBERT(model=model)
# Fast extraction for large batches
model = SentenceTransformer("all-MiniLM-L6-v2")
kw_model = KeyBERT(model=model)
|
You can also use OpenAI embeddings for higher quality extraction:
1
2
3
4
5
6
7
8
9
10
11
12
13
| from keybert import KeyBERT
import openai
class OpenAIBackend:
def __init__(self, model: str = "text-embedding-3-small"):
self.client = openai.OpenAI()
self.model = model
def embed(self, documents: list[str], verbose=False) -> list[list[float]]:
response = self.client.embeddings.create(model=self.model, input=documents)
return [d.embedding for d in response.data]
kw_model = KeyBERT(model=OpenAIBackend())
|
Batch Processing for Large Document Sets#
When processing thousands of documents, extract keywords in batches for efficiency:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
| import json
from pathlib import Path
def extract_keywords_batch(
documents: list[dict],
kw_model: KeyBERT,
top_n: int = 10,
) -> list[dict]:
"""Extract keywords from a batch of documents."""
texts = [doc["text"] for doc in documents]
# KeyBERT handles batching internally
all_keywords = kw_model.extract_keywords(
texts,
keyphrase_ngram_range=(1, 3),
stop_words="english",
use_mmr=True,
diversity=0.5,
top_n=top_n,
)
results = []
for doc, keywords in zip(documents, all_keywords):
results.append({
"id": doc["id"],
"title": doc.get("title", ""),
"keywords": [{"keyword": kw, "score": round(score, 4)} for kw, score in keywords],
})
return results
# Process a batch
documents = [
{"id": "1", "title": "RAG Guide", "text": "Retrieval-augmented generation..."},
{"id": "2", "title": "Fine-tuning", "text": "LoRA fine-tuning of large..."},
]
results = extract_keywords_batch(documents, kw_model)
for r in results:
print(f"\n{r['title']}:")
for kw in r["keywords"][:5]:
print(f" {kw['score']} {kw['keyword']}")
|
Comparing with TF-IDF and RAKE#
KeyBERT isn’t the only option. Here’s how the main keyword extraction methods compare:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
| # Method 1: TF-IDF (frequency-based, no semantic understanding)
from sklearn.feature_extraction.text import TfidfVectorizer
def tfidf_keywords(documents: list[str], top_n: int = 10) -> list[list[tuple]]:
vectorizer = TfidfVectorizer(
max_features=1000,
stop_words="english",
ngram_range=(1, 3),
)
tfidf_matrix = vectorizer.fit_transform(documents)
feature_names = vectorizer.get_feature_names_out()
results = []
for i in range(tfidf_matrix.shape[0]):
row = tfidf_matrix[i].toarray().flatten()
top_indices = row.argsort()[-top_n:][::-1]
keywords = [(feature_names[j], round(row[j], 4)) for j in top_indices if row[j] > 0]
results.append(keywords)
return results
# Method 2: RAKE (Rapid Automatic Keyword Extraction)
from rake_nltk import Rake
def rake_keywords(text: str, top_n: int = 10) -> list[tuple]:
rake = Rake()
rake.extract_keywords_from_text(text)
return rake.get_ranked_phrases_with_scores()[:top_n]
# Compare all three
doc = "Your document text here..."
print("KeyBERT:")
for kw, score in kw_model.extract_keywords(doc, top_n=5):
print(f" {score:.4f} {kw}")
print("\nTF-IDF:")
for kw, score in tfidf_keywords([doc], top_n=5)[0]:
print(f" {score:.4f} {kw}")
print("\nRAKE:")
for score, kw in rake_keywords(doc, top_n=5):
print(f" {score:.2f} {kw}")
|
| Method | Pros | Cons | Best For |
|---|
| KeyBERT | Semantic understanding, handles synonyms | Slower, needs GPU for large batches | Single documents, quality over speed |
| TF-IDF | Fast, works well on document collections | No semantic understanding, needs a corpus | Large document sets, corpus-level analysis |
| RAKE | No training needed, very fast | Often noisy results, no ranking by relevance | Quick extraction, preprocessing step |
When you know the domain, guide KeyBERT with seed keywords to focus extraction on relevant terms:
1
2
3
4
5
6
7
8
9
| # Seed keywords steer extraction toward your domain
seed_keywords = ["machine learning", "neural network", "training", "model"]
keywords = kw_model.extract_keywords(
doc,
seed_keywords=seed_keywords,
keyphrase_ngram_range=(1, 3),
top_n=10,
)
|
Seed keywords bias the similarity scoring toward terms related to your seeds. This helps when documents cover multiple topics and you want keywords from a specific angle.
Common Errors and Fixes#
Keywords are too generic (“data”, “system”, “method”)
Increase the n-gram range to (2, 3) so KeyBERT extracts phrases instead of single words. Also add domain-specific stop words:
1
2
3
4
| custom_stops = list(set(
["data", "system", "method", "approach", "based", "using", "paper", "proposed"]
))
keywords = kw_model.extract_keywords(doc, stop_words=custom_stops)
|
Slow on long documents
KeyBERT embeds the full document and every candidate keyword. For documents over 5000 words, truncate or split into sections and extract keywords per section, then deduplicate.
Empty results on short text
KeyBERT needs enough text to generate meaningful candidates. For tweets or short messages, lower the n-gram range to (1, 1) and reduce top_n. Below 20 words, consider using entity extraction instead.
Different results on same document
Some embedding models are non-deterministic. Set a random seed: torch.manual_seed(42) before extraction. Or use a deterministic model like all-MiniLM-L6-v2.
KeyBERT misses domain-specific jargon
The default embedding model may not understand specialized terminology. Switch to a domain-specific model (SciBERT for science, BioBERT for medical, FinBERT for finance) or use OpenAI embeddings which have broader vocabulary coverage.