BERTopic is the best topic modeling library available right now. It blows traditional LDA out of the water because it uses transformer embeddings instead of bag-of-words representations. You get coherent, human-readable topics without endless hyperparameter tuning.
Here’s the fastest path from raw text to meaningful topics:
| |
That’s it. Three lines to fit a model that clusters your documents into semantically meaningful groups. The output gives you a DataFrame with topic IDs, counts, and representative words.
Installing BERTopic
Install the base package first, then add extras as needed:
| |
The base install pulls in sentence-transformers, hdbscan, umap-learn, and scikit-learn. If you hit issues with HDBSCAN on Apple Silicon or older Linux, install it separately first:
| |
Use Python 3.9+ minimum. I’d recommend 3.11 for the speed improvements.
Inspecting Topics
Once the model is trained, you want to dig into what it found. The API makes this straightforward:
| |
Topic -1 is the outlier bucket. BERTopic puts documents there when they don’t fit cleanly into any cluster. A large outlier topic usually means your data is noisy or you need to tune HDBSCAN’s min_cluster_size.
Visualizing Topic Clusters
The built-in visualizations are genuinely useful, not just eye candy. The intertopic distance map is where you should start:
| |
The hierarchy visualization is particularly valuable when you’re deciding how many topics to keep. It shows you which topics are semantically close and could be merged.
Using Custom Embedding Models
The default all-MiniLM-L6-v2 model works fine for English text, but you should swap it out for domain-specific work. This is where BERTopic really shines over LDA – you can plug in any embedding model.
| |
For multilingual corpora, use paraphrase-multilingual-MiniLM-L12-v2. For scientific text, allenai/specter2 works better than general-purpose models. You can also pass pre-computed embeddings directly, which saves time if you’re iterating on the clustering parameters:
| |
Pre-computing embeddings is the single biggest time saver when you’re experimenting. Encoding 100k documents takes minutes; clustering takes seconds.
Reducing Topics
BERTopic often generates too many fine-grained topics. You have two options to consolidate them.
Set the target number upfront:
| |
Or reduce after fitting, which gives you more control:
| |
I prefer the post-fit approach. You can inspect the full topic set first, then merge strategically. Setting nr_topics="auto" uses HDBSCAN’s built-in method to find a reasonable number, but it tends to be conservative.
Dynamic Topic Modeling Over Time
If your documents have timestamps, you can track how topics evolve. This is one of BERTopic’s killer features:
| |
The nr_bins parameter controls temporal granularity. Set it too high and you get noise. Too low and you miss trends. Start with monthly bins and adjust from there.
Tuning HDBSCAN and UMAP
The defaults work for most cases, but when they don’t, tune UMAP first, then HDBSCAN:
| |
The most impactful parameter is min_cluster_size. Set it to roughly the smallest topic size you care about. If you have 10k documents and want broad themes, try 100-200. For fine-grained subtopics, try 20-50.
Common Errors
ValueError: n_samples=X should be >= n_clusters=Y
Your dataset is too small for the default UMAP/HDBSCAN parameters. Lower min_cluster_size to 5-10, or reduce n_neighbors in UMAP:
| |
ModuleNotFoundError: No module named 'hdbscan'
HDBSCAN can fail to install on some systems. Fix it with:
| |
All documents assigned to topic -1 (outliers)
This means HDBSCAN can’t find dense clusters. Your embeddings might be too uniform, or min_cluster_size is too high. Try lowering it, or switch to a different embedding model that creates more separation in your domain.
MemoryError on large datasets
Pre-compute embeddings in batches and pass them in. Also reduce UMAP’s n_components to 5 (the default) and consider sampling your data for the initial fit, then using transform() on the rest:
| |
TypeError: cannot unpack non-iterable NoneType object
You’re calling fit_transform on an empty or near-empty list. Check that your document list has actual content and isn’t full of empty strings. Filter them out before fitting.
Related Guides
- How to Extract Keywords and Key Phrases from Text with KeyBERT
- How to Build a RAG Pipeline with Hugging Face Transformers v5
- How to Build a Text Entailment and Contradiction Detection Pipeline
- How to Build an Extractive Question Answering System with Transformers
- How to Build a Text Summarization Pipeline with Sumy and Transformers
- How to Parse Document Layouts with LayoutLM and Transformers
- How to Build a Sentiment-Aware Search Pipeline with Embeddings
- How to Build an Abstractive Summarization Pipeline with PEGASUS
- How to Build an Aspect-Based Sentiment Analysis Pipeline
- How to Build a Text Chunking and Splitting Pipeline for RAG