The Quick Answer: Hash First, Embed Second

Text deduplication breaks down into three tiers. Exact duplicates are cheap – hash the text and compare. Near-duplicates (rephrased, reordered, minor edits) need MinHash with Locality-Sensitive Hashing. Semantic duplicates (same meaning, different words) require embeddings and cosine similarity. Stack all three for a pipeline that catches everything without burning GPU hours on trivial matches.

Here is the full tiered pipeline:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
import hashlib
from datasketch import MinHash, MinHashLSH
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

documents = [
    "The cat sat on the mat.",
    "The cat sat on the mat.",               # exact duplicate
    "A cat was sitting on the mat.",          # near-duplicate
    "A feline rested upon the rug.",          # semantic duplicate
    "Python is a programming language.",      # unrelated
    "Python is a popular coding language.",   # near-duplicate of above
]

# --- Tier 1: Exact duplicate removal via hashing ---
seen_hashes = {}
unique_docs = []
for doc in documents:
    h = hashlib.sha256(doc.strip().lower().encode("utf-8")).hexdigest()
    if h not in seen_hashes:
        seen_hashes[h] = doc
        unique_docs.append(doc)

print(f"After exact dedup: {len(unique_docs)} (was {len(documents)})")

# --- Tier 2: Near-duplicate detection with MinHash LSH ---
lsh = MinHashLSH(threshold=0.5, num_perm=128)
minhash_map = {}

for i, doc in enumerate(unique_docs):
    mh = MinHash(num_perm=128)
    for token in doc.lower().split():
        mh.update(token.encode("utf-8"))
    minhash_map[i] = mh
    lsh.insert(str(i), mh)

near_dup_groups = []
visited = set()
for i, mh in minhash_map.items():
    if i in visited:
        continue
    neighbors = [int(x) for x in lsh.query(mh) if int(x) != i]
    if neighbors:
        group = [i] + neighbors
        near_dup_groups.append([unique_docs[j] for j in group])
        visited.update(neighbors)

print(f"Near-duplicate groups: {near_dup_groups}")

# --- Tier 3: Semantic similarity with embeddings ---
model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(unique_docs)
sim_matrix = cosine_similarity(embeddings)

semantic_threshold = 0.85
for i in range(len(unique_docs)):
    for j in range(i + 1, len(unique_docs)):
        if sim_matrix[i][j] > semantic_threshold:
            print(f"Semantic match ({sim_matrix[i][j]:.3f}): "
                  f"'{unique_docs[i][:50]}' <-> '{unique_docs[j][:50]}'")

Install the dependencies:

1
pip install datasketch sentence-transformers scikit-learn hdbscan

Why Hashing Comes First

Computing embeddings for millions of documents is expensive. SHA-256 hashing is essentially free by comparison. In web scrapes and user-generated content, 5-20% of documents are often byte-for-byte identical. Stripping those out before touching MinHash or a transformer model saves real time and money.

Normalize before hashing. At minimum, strip whitespace and lowercase the text. For stricter matching, remove punctuation and collapse multiple spaces:

1
2
3
4
5
6
7
import re

def normalize(text: str) -> str:
    text = text.lower().strip()
    text = re.sub(r"[^\w\s]", "", text)
    text = re.sub(r"\s+", " ", text)
    return text

This catches duplicates that differ only in casing, trailing spaces, or stray punctuation.

Near-Duplicate Detection with MinHash LSH

MinHash approximates the Jaccard similarity between two sets of tokens. LSH (Locality-Sensitive Hashing) makes the lookup sublinear – you don’t compare every document against every other document. The datasketch library handles both.

The two parameters that matter:

  • threshold – the Jaccard similarity cutoff. 0.5 is aggressive (catches loosely similar text). 0.8 is conservative (only near-identical). Start at 0.7 for general dedup and tune from there.
  • num_perm – the number of permutation functions. More permutations mean better approximation but more memory. 128 is a solid default. Go to 256 if you need higher precision on a large corpus.

For documents longer than a sentence, shingle the text instead of splitting on whitespace. Character n-grams (shingles) catch reordering and minor edits better than word tokens:

1
2
3
4
5
6
7
8
def shingle(text: str, k: int = 5) -> set:
    """Generate character k-shingles from text."""
    text = text.lower()
    return {text[i:i+k] for i in range(len(text) - k + 1)}

mh = MinHash(num_perm=128)
for s in shingle("A cat was sitting on the mat.", k=5):
    mh.update(s.encode("utf-8"))

Character 5-shingles work well for English text. For short strings (tweets, titles), drop to 3-shingles.

Semantic Deduplication with Sentence-Transformers

MinHash misses semantic duplicates – “The server crashed” and “The machine went down” share almost no tokens but mean the same thing. This is where embeddings come in.

all-MiniLM-L6-v2 is fast and good enough for dedup. If you need higher accuracy and can afford the compute, use all-mpnet-base-v2. They produce 384-dimensional and 768-dimensional vectors, respectively.

For large corpora, computing the full pairwise cosine similarity matrix does not scale. Use FAISS for approximate nearest neighbor search instead:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import faiss

model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(unique_docs, show_progress_bar=True, batch_size=256)
embeddings = embeddings / np.linalg.norm(embeddings, axis=1, keepdims=True)

dimension = embeddings.shape[1]
index = faiss.IndexFlatIP(dimension)  # inner product = cosine sim on normalized vectors
index.add(embeddings.astype("float32"))

# Find top-5 nearest neighbors for each document
k = 5
distances, indices = index.search(embeddings.astype("float32"), k)

# Extract duplicate pairs above threshold
semantic_pairs = []
for i in range(len(unique_docs)):
    for rank in range(1, k):  # skip self-match at rank 0
        if distances[i][rank] > 0.85:
            j = indices[i][rank]
            if i < j:  # avoid reporting both (i,j) and (j,i)
                semantic_pairs.append((i, j, distances[i][rank]))

for i, j, score in semantic_pairs:
    print(f"({score:.3f}) {unique_docs[i][:60]} <-> {unique_docs[j][:60]}")

FAISS with IndexFlatIP is exact search. For millions of documents, switch to IndexIVFFlat or IndexHNSWFlat for approximate search that is 10-100x faster with minimal accuracy loss.

Clustering Similar Texts with HDBSCAN

Sometimes you don’t want to just find pairs – you want to group all similar documents into clusters. HDBSCAN handles this well because it doesn’t require you to pick the number of clusters upfront, and it labels outliers as noise rather than forcing them into a cluster.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import hdbscan

model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(unique_docs)

clusterer = hdbscan.HDBSCAN(
    min_cluster_size=2,
    min_samples=1,
    metric="euclidean",
    cluster_selection_method="eom",
)
labels = clusterer.fit_predict(embeddings)

# Group documents by cluster
from collections import defaultdict
clusters = defaultdict(list)
for idx, label in enumerate(labels):
    clusters[label].append(unique_docs[idx])

for label, docs in sorted(clusters.items()):
    status = "NOISE" if label == -1 else f"Cluster {label}"
    print(f"\n{status}:")
    for doc in docs:
        print(f"  - {doc[:80]}")

Set min_cluster_size=2 for dedup – you want even pairs of duplicates to form a cluster. The eom (Excess of Mass) selection method tends to produce more fine-grained clusters, which is what you want for dedup rather than broad topic grouping.

Scaling to Millions of Documents

The tiered approach is not just about accuracy – it is about cost. Here is the order of operations for a production pipeline:

  1. Hash dedup – O(n), negligible memory. Removes 5-20% of documents instantly.
  2. MinHash LSH – O(n) insert and sublinear query. Catches surface-level near-duplicates. Budget roughly 1KB per document for the MinHash signatures.
  3. Embedding + FAISS – O(n) for encoding (GPU-bound), sublinear for search. Catches semantic duplicates that MinHash misses.
  4. HDBSCAN clustering – O(n log n) on the remaining candidates. Groups related documents for human review or automatic selection.

Process in batches. Encode 10K-50K documents at a time on GPU, add to the FAISS index incrementally, and run HDBSCAN on candidate pairs only – not on the full corpus.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# Batch encoding example
batch_size = 10000
all_embeddings = []

for start in range(0, len(corpus), batch_size):
    batch = corpus[start:start + batch_size]
    emb = model.encode(batch, batch_size=256, show_progress_bar=True)
    all_embeddings.append(emb)

embeddings = np.vstack(all_embeddings)

Common Errors and Fixes

ValueError: num_perm must be positive – You passed num_perm=0 or a negative value to MinHash(). This happens when reading config from a file and the value comes in as a string. Cast it to int.

RuntimeError: CUDA out of memory during encoding – Reduce the batch_size in model.encode(). Start at 64 and work up. If you are encoding millions of documents, use model.encode(..., device="cpu") for the long tail and GPU for the bulk.

MinHash LSH returns no results even for obvious duplicates – Your threshold is too high relative to the actual Jaccard similarity. Shingle-based similarity is lower than you might expect – two sentences that share 80% of words might only have 0.5 Jaccard similarity on 5-character shingles. Lower the threshold or use word-level tokens.

HDBSCAN puts everything in the noise cluster (-1) – Your min_cluster_size is too large or the embeddings are too spread out. Try min_cluster_size=2 and min_samples=1. If that still does not work, reduce the embedding dimensionality with UMAP before clustering:

1
pip install umap-learn
1
2
3
4
5
import umap

reducer = umap.UMAP(n_components=10, metric="cosine")
reduced = reducer.fit_transform(embeddings)
labels = hdbscan.HDBSCAN(min_cluster_size=2).fit_predict(reduced)

datasketch is slow on very large datasets – The pure Python implementation of MinHash is not fast. For corpora over 10M documents, look at text-dedup or gaoya (Rust-based LSH). Alternatively, process in shards and merge results.

Cosine similarity scores are all near 1.0 – Your embeddings are not normalized, or you are comparing very short texts that collapse to similar vectors. Check with np.linalg.norm(embedding) – it should be close to 1.0 after normalization. For short texts, use a model fine-tuned on short-form data like all-MiniLM-L6-v2 rather than a passage-level model.