How to Build a Text Clustering Pipeline with Embeddings and HDBSCAN

Text clustering finds structure in unlabeled data. You feed in documents, and the pipeline groups them by semantic similarity – no predefined categories, no training labels. The combination of sentence-transformers for embeddings, UMAP for dimensionality reduction, and HDBSCAN for clustering is the strongest general-purpose approach available right now.

Here is the full pipeline in one shot:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
from sentence_transformers import SentenceTransformer
from umap import UMAP
from hdbscan import HDBSCAN
import numpy as np

documents = [
    "The Federal Reserve raised interest rates by 25 basis points.",
    "Inflation remains above the central bank's 2% target.",
    "NASA's Artemis mission aims to return humans to the Moon.",
    "SpaceX launched another batch of Starlink satellites.",
    "The new iPhone features a titanium frame and USB-C port.",
    "Samsung unveiled its latest foldable phone at Galaxy Unpacked.",
    "Manchester United signed a new striker in the transfer window.",
    "The NBA playoffs start next week with several exciting matchups.",
    "Bond yields rose sharply after the jobs report surprised analysts.",
    "The European Central Bank held rates steady at its last meeting.",
    "Boeing's Starliner spacecraft completed its first crewed flight.",
    "A new exoplanet was discovered in the habitable zone of a nearby star.",
    "Apple released a major update to macOS with AI-powered features.",
    "Google announced new Pixel hardware at its annual developer conference.",
    "The World Cup qualifiers produced several upsets this week.",
    "Premier League clubs spent a record amount in the January window.",
]

model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(documents, show_progress_bar=True)

reducer = UMAP(n_components=5, n_neighbors=5, min_dist=0.0, metric="cosine", random_state=42)
reduced = reducer.fit_transform(embeddings)

clusterer = HDBSCAN(min_cluster_size=3, min_samples=2, metric="euclidean")
labels = clusterer.fit_predict(reduced)

for label, doc in zip(labels, documents):
    print(f"[Cluster {label:>2}] {doc}")

That gives you cluster assignments for every document. Cluster -1 means HDBSCAN classified that document as noise – it did not fit cleanly into any group. The rest get integer labels starting from 0.

Install Dependencies

1
pip install sentence-transformers umap-learn hdbscan matplotlib scikit-learn

A note on hdbscan: if the install fails with compilation errors, try:

1
pip install hdbscan --no-build-isolation

Use Python 3.9 or newer. I recommend 3.11+ for better performance across the board.

Why HDBSCAN Beats K-Means for Text

K-Means forces you to pick the number of clusters upfront. That is already a problem – how do you know there are exactly 7 topics in your corpus? But it gets worse. K-Means assumes clusters are spherical and roughly equal-sized. Text data almost never looks like that. You get one massive cluster eating everything and several tiny ones getting split apart.

HDBSCAN solves both problems:

No predefined cluster count. It discovers the number of clusters from the data density.
Handles irregular shapes. It finds clusters of varying density and arbitrary geometry.
Noise detection. Documents that do not belong anywhere get labeled -1 instead of being forced into the nearest cluster.
Soft clustering. You can get probability scores for cluster membership, not just hard assignments.

The tradeoff is speed. HDBSCAN is slower than K-Means on very large datasets. For anything under 500K documents, you will not notice. Above that, consider batching or subsampling for the initial fit.

Dimensionality Reduction with UMAP

Sentence embeddings from all-MiniLM-L6-v2 are 384-dimensional. Clustering directly in that space works poorly because of the curse of dimensionality – distances between points become less meaningful as dimensions grow. UMAP projects those 384 dimensions down to a lower-dimensional space while preserving the local neighborhood structure.

The key parameters:

n_components: Target dimensions. Use 5-10 for clustering (not 2 – that is only good for visualization). More components preserve more information but slow down clustering.
n_neighbors: Controls how UMAP balances local versus global structure. Lower values (5-10) emphasize local clusters. Higher values (30-50) capture broader patterns. Start with 15 and adjust.
min_dist: Set to 0.0 for clustering. This lets UMAP pack similar points tightly together, which is exactly what HDBSCAN needs.
metric: Use "cosine" for sentence embeddings. These vectors have meaning in their direction, not their magnitude.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
from umap import UMAP

# For clustering (not visualization)
reducer_cluster = UMAP(
    n_components=5,
    n_neighbors=15,
    min_dist=0.0,
    metric="cosine",
    random_state=42,
)

# For 2D visualization
reducer_viz = UMAP(
    n_components=2,
    n_neighbors=15,
    min_dist=0.1,
    metric="cosine",
    random_state=42,
)

reduced_for_clustering = reducer_cluster.fit_transform(embeddings)
reduced_for_viz = reducer_viz.fit_transform(embeddings)

Always fit UMAP with random_state set if you want reproducible results. UMAP is stochastic by default.

Extracting Cluster Labels Automatically

Once you have cluster assignments, you want human-readable labels for each group. The simplest approach is TF-IDF over cluster members. For each cluster, concatenate its documents and find the most distinctive terms compared to other clusters.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

def extract_cluster_labels(documents, labels, top_n=5):
    unique_labels = set(labels)
    unique_labels.discard(-1)  # skip noise

    # Group documents by cluster
    cluster_docs = {}
    for label in unique_labels:
        cluster_docs[label] = " ".join(
            doc for doc, lbl in zip(documents, labels) if lbl == label
        )

    # Fit TF-IDF on the concatenated cluster texts
    corpus = [cluster_docs[label] for label in sorted(cluster_docs.keys())]
    vectorizer = TfidfVectorizer(stop_words="english", max_features=1000)
    tfidf_matrix = vectorizer.fit_transform(corpus)
    feature_names = vectorizer.get_feature_names_out()

    cluster_labels = {}
    for idx, label in enumerate(sorted(cluster_docs.keys())):
        scores = tfidf_matrix[idx].toarray().flatten()
        top_indices = scores.argsort()[-top_n:][::-1]
        top_terms = [feature_names[i] for i in top_indices]
        cluster_labels[label] = top_terms

    return cluster_labels

topic_labels = extract_cluster_labels(documents, labels)
for cluster_id, terms in topic_labels.items():
    print(f"Cluster {cluster_id}: {', '.join(terms)}")

This produces output like:

1
2
3
4
Cluster 0: rates, inflation, bank, federal, central
Cluster 1: launched, spacex, nasa, spacecraft, exoplanet
Cluster 2: apple, phone, pixel, samsung, features
Cluster 3: premier, league, nba, cup, transfer

The TF-IDF approach works well for topic extraction because it surfaces terms that are frequent within a cluster but rare across other clusters. If you want richer labels, look at KeyBERT or even pass the top terms through an LLM to generate a human-readable summary.

Visualizing Clusters

A 2D scatter plot is the fastest way to sanity-check your clusters. Reduce to 2 dimensions with UMAP separately from your clustering reduction – you want 5+ dimensions for good clustering, but 2 for visualization.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import matplotlib.pyplot as plt
from umap import UMAP
import numpy as np

# Reduce to 2D for plotting
reducer_2d = UMAP(n_components=2, n_neighbors=15, min_dist=0.1, metric="cosine", random_state=42)
coords = reducer_2d.fit_transform(embeddings)

# Plot
fig, ax = plt.subplots(figsize=(10, 7))

unique_labels = set(labels)
colors = plt.cm.tab10(np.linspace(0, 1, len(unique_labels)))
color_map = {label: colors[i] for i, label in enumerate(sorted(unique_labels))}

for label in sorted(unique_labels):
    mask = np.array(labels) == label
    name = "Noise" if label == -1 else f"Cluster {label}"
    alpha = 0.3 if label == -1 else 0.8
    ax.scatter(
        coords[mask, 0],
        coords[mask, 1],
        c=[color_map[label]],
        label=name,
        alpha=alpha,
        s=60,
        edgecolors="white",
        linewidth=0.5,
    )

ax.legend(loc="best", framealpha=0.9)
ax.set_title("Text Clusters (UMAP + HDBSCAN)")
ax.set_xlabel("UMAP 1")
ax.set_ylabel("UMAP 2")
plt.tight_layout()
plt.savefig("clusters.png", dpi=150)
plt.show()

If your clusters overlap heavily in the 2D projection, that does not necessarily mean the clustering is bad. UMAP projections to 2D lose information. Check the cluster contents directly before changing parameters.

Common Errors and Fixes

ModuleNotFoundError: No module named 'hdbscan'

HDBSCAN has C dependencies that can fail silently. Reinstall with build isolation disabled:

1
pip install hdbscan --no-build-isolation

On Apple Silicon Macs, you may also need:

1
CFLAGS="-stdlib=libc++" pip install hdbscan --no-build-isolation

Everything is assigned to cluster -1 (all noise)

HDBSCAN cannot find any dense regions. Three likely causes:

min_cluster_size is too high. Lower it. For small datasets (under 100 documents), try min_cluster_size=3.
Your embeddings do not have enough separation. Try a larger model like all-mpnet-base-v2 or a domain-specific one.
You skipped UMAP. Clustering raw 384-dimensional embeddings is much harder. Always reduce dimensionality first.

ValueError: n_components must be less than or equal to the number of features

You set n_components in UMAP higher than the number of input features or samples. If you have only 20 documents, you cannot reduce to 50 dimensions. Set n_components to at most min(n_samples, n_features) - 1.

UMAP is extremely slow on large datasets

UMAP’s default nearest-neighbor algorithm scales poorly past ~100K samples. Install pynndescent for faster approximate neighbors:

1
pip install pynndescent

UMAP will detect and use it automatically. For datasets over 500K documents, consider fitting UMAP on a random subset and then using .transform() on the rest.

TypeError: A sparse matrix was passed, but dense data is required

You passed a sparse matrix (from scikit-learn’s vectorizers) to UMAP or HDBSCAN. Convert it first with .toarray() or use sentence-transformers which returns dense arrays by default.

Clusters change every run

Both UMAP and HDBSCAN have stochastic components. Set random_state=42 (or any fixed integer) in UMAP for reproducibility. HDBSCAN itself is deterministic given the same input, so fixing UMAP’s seed is usually enough.

Install Dependencies#

Why HDBSCAN Beats K-Means for Text#

Dimensionality Reduction with UMAP#

Extracting Cluster Labels Automatically#

Visualizing Clusters#

Common Errors and Fixes#

Related Guides#

About the Author