BERTopic is the best topic modeling library available right now. It blows traditional LDA out of the water because it uses transformer embeddings instead of bag-of-words representations. You get coherent, human-readable topics without endless hyperparameter tuning.

Here’s the fastest path from raw text to meaningful topics:

1
2
3
4
5
6
7
8
9
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups

docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']

topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)

print(topic_model.get_topic_info())

That’s it. Three lines to fit a model that clusters your documents into semantically meaningful groups. The output gives you a DataFrame with topic IDs, counts, and representative words.

Installing BERTopic

Install the base package first, then add extras as needed:

1
2
3
pip install bertopic
pip install bertopic[visualization]
pip install bertopic[flair]

The base install pulls in sentence-transformers, hdbscan, umap-learn, and scikit-learn. If you hit issues with HDBSCAN on Apple Silicon or older Linux, install it separately first:

1
pip install hdbscan --no-build-isolation

Use Python 3.9+ minimum. I’d recommend 3.11 for the speed improvements.

Inspecting Topics

Once the model is trained, you want to dig into what it found. The API makes this straightforward:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Get the top 10 most frequent topics
topic_model.get_topic_info().head(10)

# Look at the words defining topic 0
topic_model.get_topic(0)

# Find which documents belong to a specific topic
topic_model.get_document_info(docs).query("Topic == 3").head()

# Get representative docs for a topic
topic_model.get_representative_docs(0)

Topic -1 is the outlier bucket. BERTopic puts documents there when they don’t fit cleanly into any cluster. A large outlier topic usually means your data is noisy or you need to tune HDBSCAN’s min_cluster_size.

Visualizing Topic Clusters

The built-in visualizations are genuinely useful, not just eye candy. The intertopic distance map is where you should start:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
# Interactive intertopic distance map
fig = topic_model.visualize_topics()
fig.show()

# Bar chart of top words per topic
fig = topic_model.visualize_barchart(top_n_topics=10)
fig.write_html("topic_barchart.html")

# Hierarchical clustering of topics
fig = topic_model.visualize_hierarchy()
fig.show()

# Heatmap showing topic similarity
fig = topic_model.visualize_heatmap()
fig.show()

The hierarchy visualization is particularly valuable when you’re deciding how many topics to keep. It shows you which topics are semantically close and could be merged.

Using Custom Embedding Models

The default all-MiniLM-L6-v2 model works fine for English text, but you should swap it out for domain-specific work. This is where BERTopic really shines over LDA – you can plug in any embedding model.

1
2
3
4
5
6
7
from sentence_transformers import SentenceTransformer
from bertopic import BERTopic

# Use a larger, more accurate model
embedding_model = SentenceTransformer("all-mpnet-base-v2")
topic_model = BERTopic(embedding_model=embedding_model)
topics, probs = topic_model.fit_transform(docs)

For multilingual corpora, use paraphrase-multilingual-MiniLM-L12-v2. For scientific text, allenai/specter2 works better than general-purpose models. You can also pass pre-computed embeddings directly, which saves time if you’re iterating on the clustering parameters:

1
2
3
4
5
6
7
8
from sentence_transformers import SentenceTransformer

embedding_model = SentenceTransformer("all-mpnet-base-v2")
embeddings = embedding_model.encode(docs, show_progress_bar=True)

# Now iterate on parameters without re-encoding
topic_model = BERTopic(nr_topics=20)
topics, probs = topic_model.fit_transform(docs, embeddings=embeddings)

Pre-computing embeddings is the single biggest time saver when you’re experimenting. Encoding 100k documents takes minutes; clustering takes seconds.

Reducing Topics

BERTopic often generates too many fine-grained topics. You have two options to consolidate them.

Set the target number upfront:

1
2
topic_model = BERTopic(nr_topics=15)
topics, probs = topic_model.fit_transform(docs)

Or reduce after fitting, which gives you more control:

1
2
3
4
5
6
topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)

# Reduce to 20 topics by merging similar ones
topic_model.reduce_topics(docs, nr_topics=20)
print(topic_model.get_topic_info())

I prefer the post-fit approach. You can inspect the full topic set first, then merge strategically. Setting nr_topics="auto" uses HDBSCAN’s built-in method to find a reasonable number, but it tends to be conservative.

Dynamic Topic Modeling Over Time

If your documents have timestamps, you can track how topics evolve. This is one of BERTopic’s killer features:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
from bertopic import BERTopic
import datetime

# Assume docs and timestamps are aligned lists
timestamps = ["2024-01", "2024-01", "2024-02", "2024-03", ...]  # one per doc

topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)

# Calculate topic evolution over time
topics_over_time = topic_model.topics_over_time(docs, timestamps, nr_bins=12)

# Visualize it
fig = topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=10)
fig.show()

The nr_bins parameter controls temporal granularity. Set it too high and you get noise. Too low and you miss trends. Start with monthly bins and adjust from there.

Tuning HDBSCAN and UMAP

The defaults work for most cases, but when they don’t, tune UMAP first, then HDBSCAN:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
from umap import UMAP
from hdbscan import HDBSCAN
from bertopic import BERTopic

umap_model = UMAP(
    n_neighbors=15,       # higher = more global structure, lower = more local detail
    n_components=5,        # keep at 5 for clustering, drop to 2 only for viz
    min_dist=0.0,          # 0.0 is best for clustering
    metric="cosine"        # cosine works well with sentence embeddings
)

hdbscan_model = HDBSCAN(
    min_cluster_size=50,   # minimum docs per topic, raise to get fewer broader topics
    min_samples=10,        # higher = more conservative clustering
    prediction_data=True   # needed for soft clustering
)

topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model)
topics, probs = topic_model.fit_transform(docs)

The most impactful parameter is min_cluster_size. Set it to roughly the smallest topic size you care about. If you have 10k documents and want broad themes, try 100-200. For fine-grained subtopics, try 20-50.

Common Errors

ValueError: n_samples=X should be >= n_clusters=Y

Your dataset is too small for the default UMAP/HDBSCAN parameters. Lower min_cluster_size to 5-10, or reduce n_neighbors in UMAP:

1
2
3
4
topic_model = BERTopic(
    umap_model=UMAP(n_neighbors=5, n_components=3, min_dist=0.0),
    hdbscan_model=HDBSCAN(min_cluster_size=5)
)

ModuleNotFoundError: No module named 'hdbscan'

HDBSCAN can fail to install on some systems. Fix it with:

1
pip install hdbscan --no-build-isolation

All documents assigned to topic -1 (outliers)

This means HDBSCAN can’t find dense clusters. Your embeddings might be too uniform, or min_cluster_size is too high. Try lowering it, or switch to a different embedding model that creates more separation in your domain.

MemoryError on large datasets

Pre-compute embeddings in batches and pass them in. Also reduce UMAP’s n_components to 5 (the default) and consider sampling your data for the initial fit, then using transform() on the rest:

1
2
3
4
5
# Fit on a sample
topic_model.fit(docs[:50000], embeddings=embeddings[:50000])

# Transform the rest
new_topics, new_probs = topic_model.transform(docs[50000:], embeddings=embeddings[50000:])

TypeError: cannot unpack non-iterable NoneType object

You’re calling fit_transform on an empty or near-empty list. Check that your document list has actual content and isn’t full of empty strings. Filter them out before fitting.