Text clustering finds structure in unlabeled data. You feed in documents, and the pipeline groups them by semantic similarity – no predefined categories, no training labels. The combination of sentence-transformers for embeddings, UMAP for dimensionality reduction, and HDBSCAN for clustering is the strongest general-purpose approach available right now.
Here is the full pipeline in one shot:
| |
That gives you cluster assignments for every document. Cluster -1 means HDBSCAN classified that document as noise – it did not fit cleanly into any group. The rest get integer labels starting from 0.
Install Dependencies
| |
A note on hdbscan: if the install fails with compilation errors, try:
| |
Use Python 3.9 or newer. I recommend 3.11+ for better performance across the board.
Why HDBSCAN Beats K-Means for Text
K-Means forces you to pick the number of clusters upfront. That is already a problem – how do you know there are exactly 7 topics in your corpus? But it gets worse. K-Means assumes clusters are spherical and roughly equal-sized. Text data almost never looks like that. You get one massive cluster eating everything and several tiny ones getting split apart.
HDBSCAN solves both problems:
- No predefined cluster count. It discovers the number of clusters from the data density.
- Handles irregular shapes. It finds clusters of varying density and arbitrary geometry.
- Noise detection. Documents that do not belong anywhere get labeled
-1instead of being forced into the nearest cluster. - Soft clustering. You can get probability scores for cluster membership, not just hard assignments.
The tradeoff is speed. HDBSCAN is slower than K-Means on very large datasets. For anything under 500K documents, you will not notice. Above that, consider batching or subsampling for the initial fit.
Dimensionality Reduction with UMAP
Sentence embeddings from all-MiniLM-L6-v2 are 384-dimensional. Clustering directly in that space works poorly because of the curse of dimensionality – distances between points become less meaningful as dimensions grow. UMAP projects those 384 dimensions down to a lower-dimensional space while preserving the local neighborhood structure.
The key parameters:
n_components: Target dimensions. Use 5-10 for clustering (not 2 – that is only good for visualization). More components preserve more information but slow down clustering.n_neighbors: Controls how UMAP balances local versus global structure. Lower values (5-10) emphasize local clusters. Higher values (30-50) capture broader patterns. Start with 15 and adjust.min_dist: Set to0.0for clustering. This lets UMAP pack similar points tightly together, which is exactly what HDBSCAN needs.metric: Use"cosine"for sentence embeddings. These vectors have meaning in their direction, not their magnitude.
| |
Always fit UMAP with random_state set if you want reproducible results. UMAP is stochastic by default.
Extracting Cluster Labels Automatically
Once you have cluster assignments, you want human-readable labels for each group. The simplest approach is TF-IDF over cluster members. For each cluster, concatenate its documents and find the most distinctive terms compared to other clusters.
| |
This produces output like:
| |
The TF-IDF approach works well for topic extraction because it surfaces terms that are frequent within a cluster but rare across other clusters. If you want richer labels, look at KeyBERT or even pass the top terms through an LLM to generate a human-readable summary.
Visualizing Clusters
A 2D scatter plot is the fastest way to sanity-check your clusters. Reduce to 2 dimensions with UMAP separately from your clustering reduction – you want 5+ dimensions for good clustering, but 2 for visualization.
| |
If your clusters overlap heavily in the 2D projection, that does not necessarily mean the clustering is bad. UMAP projections to 2D lose information. Check the cluster contents directly before changing parameters.
Common Errors and Fixes
ModuleNotFoundError: No module named 'hdbscan'
HDBSCAN has C dependencies that can fail silently. Reinstall with build isolation disabled:
| |
On Apple Silicon Macs, you may also need:
| |
Everything is assigned to cluster -1 (all noise)
HDBSCAN cannot find any dense regions. Three likely causes:
min_cluster_sizeis too high. Lower it. For small datasets (under 100 documents), trymin_cluster_size=3.- Your embeddings do not have enough separation. Try a larger model like
all-mpnet-base-v2or a domain-specific one. - You skipped UMAP. Clustering raw 384-dimensional embeddings is much harder. Always reduce dimensionality first.
ValueError: n_components must be less than or equal to the number of features
You set n_components in UMAP higher than the number of input features or samples. If you have only 20 documents, you cannot reduce to 50 dimensions. Set n_components to at most min(n_samples, n_features) - 1.
UMAP is extremely slow on large datasets
UMAP’s default nearest-neighbor algorithm scales poorly past ~100K samples. Install pynndescent for faster approximate neighbors:
| |
UMAP will detect and use it automatically. For datasets over 500K documents, consider fitting UMAP on a random subset and then using .transform() on the rest.
TypeError: A sparse matrix was passed, but dense data is required
You passed a sparse matrix (from scikit-learn’s vectorizers) to UMAP or HDBSCAN. Convert it first with .toarray() or use sentence-transformers which returns dense arrays by default.
Clusters change every run
Both UMAP and HDBSCAN have stochastic components. Set random_state=42 (or any fixed integer) in UMAP for reproducibility. HDBSCAN itself is deterministic given the same input, so fixing UMAP’s seed is usually enough.
Related Guides
- How to Build a Text Embedding Pipeline with Sentence Transformers and FAISS
- How to Build a Text Similarity API with Cross-Encoders
- How to Build a Multilingual NLP Pipeline with Sentence Transformers
- How to Build a Document Chunking Strategy Comparison Pipeline
- How to Build a Sentiment-Aware Search Pipeline with Embeddings
- How to Build a Hybrid Keyword and Semantic Search Pipeline
- How to Build a Semantic Search Engine with Embeddings
- How to Build a Named Entity Recognition Pipeline with spaCy and Transformers
- How to Build a Text Paraphrase Pipeline with T5 and PEGASUS
- How to Build a Text Classification Pipeline with SetFit