The Quick Answer: Hash First, Embed Second
Text deduplication breaks down into three tiers. Exact duplicates are cheap – hash the text and compare. Near-duplicates (rephrased, reordered, minor edits) need MinHash with Locality-Sensitive Hashing. Semantic duplicates (same meaning, different words) require embeddings and cosine similarity. Stack all three for a pipeline that catches everything without burning GPU hours on trivial matches.
Here is the full tiered pipeline:
| |
Install the dependencies:
| |
Why Hashing Comes First
Computing embeddings for millions of documents is expensive. SHA-256 hashing is essentially free by comparison. In web scrapes and user-generated content, 5-20% of documents are often byte-for-byte identical. Stripping those out before touching MinHash or a transformer model saves real time and money.
Normalize before hashing. At minimum, strip whitespace and lowercase the text. For stricter matching, remove punctuation and collapse multiple spaces:
| |
This catches duplicates that differ only in casing, trailing spaces, or stray punctuation.
Near-Duplicate Detection with MinHash LSH
MinHash approximates the Jaccard similarity between two sets of tokens. LSH (Locality-Sensitive Hashing) makes the lookup sublinear – you don’t compare every document against every other document. The datasketch library handles both.
The two parameters that matter:
threshold– the Jaccard similarity cutoff. 0.5 is aggressive (catches loosely similar text). 0.8 is conservative (only near-identical). Start at 0.7 for general dedup and tune from there.num_perm– the number of permutation functions. More permutations mean better approximation but more memory. 128 is a solid default. Go to 256 if you need higher precision on a large corpus.
For documents longer than a sentence, shingle the text instead of splitting on whitespace. Character n-grams (shingles) catch reordering and minor edits better than word tokens:
| |
Character 5-shingles work well for English text. For short strings (tweets, titles), drop to 3-shingles.
Semantic Deduplication with Sentence-Transformers
MinHash misses semantic duplicates – “The server crashed” and “The machine went down” share almost no tokens but mean the same thing. This is where embeddings come in.
all-MiniLM-L6-v2 is fast and good enough for dedup. If you need higher accuracy and can afford the compute, use all-mpnet-base-v2. They produce 384-dimensional and 768-dimensional vectors, respectively.
For large corpora, computing the full pairwise cosine similarity matrix does not scale. Use FAISS for approximate nearest neighbor search instead:
| |
FAISS with IndexFlatIP is exact search. For millions of documents, switch to IndexIVFFlat or IndexHNSWFlat for approximate search that is 10-100x faster with minimal accuracy loss.
Clustering Similar Texts with HDBSCAN
Sometimes you don’t want to just find pairs – you want to group all similar documents into clusters. HDBSCAN handles this well because it doesn’t require you to pick the number of clusters upfront, and it labels outliers as noise rather than forcing them into a cluster.
| |
Set min_cluster_size=2 for dedup – you want even pairs of duplicates to form a cluster. The eom (Excess of Mass) selection method tends to produce more fine-grained clusters, which is what you want for dedup rather than broad topic grouping.
Scaling to Millions of Documents
The tiered approach is not just about accuracy – it is about cost. Here is the order of operations for a production pipeline:
- Hash dedup – O(n), negligible memory. Removes 5-20% of documents instantly.
- MinHash LSH – O(n) insert and sublinear query. Catches surface-level near-duplicates. Budget roughly 1KB per document for the MinHash signatures.
- Embedding + FAISS – O(n) for encoding (GPU-bound), sublinear for search. Catches semantic duplicates that MinHash misses.
- HDBSCAN clustering – O(n log n) on the remaining candidates. Groups related documents for human review or automatic selection.
Process in batches. Encode 10K-50K documents at a time on GPU, add to the FAISS index incrementally, and run HDBSCAN on candidate pairs only – not on the full corpus.
| |
Common Errors and Fixes
ValueError: num_perm must be positive – You passed num_perm=0 or a negative value to MinHash(). This happens when reading config from a file and the value comes in as a string. Cast it to int.
RuntimeError: CUDA out of memory during encoding – Reduce the batch_size in model.encode(). Start at 64 and work up. If you are encoding millions of documents, use model.encode(..., device="cpu") for the long tail and GPU for the bulk.
MinHash LSH returns no results even for obvious duplicates – Your threshold is too high relative to the actual Jaccard similarity. Shingle-based similarity is lower than you might expect – two sentences that share 80% of words might only have 0.5 Jaccard similarity on 5-character shingles. Lower the threshold or use word-level tokens.
HDBSCAN puts everything in the noise cluster (-1) – Your min_cluster_size is too large or the embeddings are too spread out. Try min_cluster_size=2 and min_samples=1. If that still does not work, reduce the embedding dimensionality with UMAP before clustering:
| |
| |
datasketch is slow on very large datasets – The pure Python implementation of MinHash is not fast. For corpora over 10M documents, look at text-dedup or gaoya (Rust-based LSH). Alternatively, process in shards and merge results.
Cosine similarity scores are all near 1.0 – Your embeddings are not normalized, or you are comparing very short texts that collapse to similar vectors. Check with np.linalg.norm(embedding) – it should be close to 1.0 after normalization. For short texts, use a model fine-tuned on short-form data like all-MiniLM-L6-v2 rather than a passage-level model.
Related Guides
- How to Build a Semantic Search Engine with Embeddings
- How to Build a Sentiment-Aware Search Pipeline with Embeddings
- How to Build a Text Deduplication Pipeline with MinHash and LSH
- How to Build a Text Clustering Pipeline with Embeddings and HDBSCAN
- How to Implement Topic Modeling with BERTopic
- How to Build a Hybrid Keyword and Semantic Search Pipeline
- How to Extract Keywords and Key Phrases from Text with KeyBERT
- How to Build a Text Chunking and Splitting Pipeline for RAG
- How to Build a Multilingual NLP Pipeline with Sentence Transformers
- How to Build a RAG Pipeline with Hugging Face Transformers v5