Exact string matching catches identical duplicates, but most real-world duplicates are near-duplicates: slightly reworded paragraphs, reformatted text, or copy-paste with minor edits. Comparing every pair of documents with cosine similarity is O(n^2) and falls apart past a few thousand texts. MinHash with LSH solves this. You get approximate Jaccard similarity in sub-linear time, which means you can deduplicate millions of documents without waiting days.
Install the one library you need:
| |
Here’s the fastest path to a working deduplication pipeline:
| |
Those three documents are all variations of the “quick brown fox” sentence. The third and fifth documents (completely different topics) don’t show up. That’s the pipeline in 30 lines.
How MinHash and LSH Actually Work
MinHash estimates Jaccard similarity between two sets without comparing them element by element. The Jaccard similarity of sets A and B is |A intersect B| / |A union B|. MinHash applies multiple random hash functions to each set, keeping only the minimum hash value from each function. The fraction of matching minimums across two signatures approximates their Jaccard similarity.
LSH takes this further. It groups MinHash signatures into bands and hashes each band separately. If two documents share at least one identical band hash, they become candidate pairs. This turns an O(n^2) comparison problem into roughly O(n) with high probability of catching true duplicates.
The key parameters:
- num_perm (number of permutations): Higher values give more accurate similarity estimates but use more memory. 128 is a solid default. Go to 256 if you need precision.
- threshold: The Jaccard similarity cutoff for considering two documents as duplicates. 0.5 is aggressive (catches loose paraphrases), 0.8 is conservative (only near-identical text).
- Shingle size (k): Smaller k (2-3) catches more overlap but increases false positives. Larger k (5-9) is stricter. For sentences, use k=3. For full documents, use k=5 or higher.
Building the Full Pipeline
A production pipeline needs to handle document loading, clustering duplicates together, and picking a representative from each cluster. Here’s a complete version:
| |
The deduplicate function groups documents into clusters using LSH neighbor queries, then picks the longest document from each cluster as the representative. You could swap that selection logic for whatever makes sense in your domain – newest document, highest quality score, or the one with the most metadata.
Tuning the Threshold
Picking the right threshold matters a lot. Too low and you merge unrelated documents. Too high and you miss real duplicates. Run a quick sweep on a sample of your data:
| |
Use the output to pick a threshold that separates your true duplicates from false positives. I find 0.5 works well for catching paraphrases and 0.7-0.8 for catching near-identical copies.
Scaling to Large Datasets
The in-memory MinHashLSH works fine up to a few million documents on a machine with 16GB+ RAM. Each MinHash signature with 128 permutations uses about 1KB. So a million documents costs roughly 1GB just for signatures.
For bigger datasets, datasketch supports Redis-backed storage:
| |
A few tips for scale:
- Batch your inserts. Building all MinHash signatures first, then inserting into LSH, is faster than interleaving.
- Use word-level shingles for long documents. Character shingles on a 10,000-word document produce huge shingle sets. Word-level 3-grams keep the set size proportional to word count.
- Parallelize MinHash computation. The signature building step is embarrassingly parallel. Use
multiprocessing.Poolorconcurrent.futures:
| |
Common Errors and Fixes
ValueError: The num_perm of input MinHash must be equal to...
This happens when you mix MinHash signatures with different num_perm values. Every signature in the same LSH index must use the same num_perm. Fix it by passing the value explicitly everywhere:
| |
ValueError: ... already exists in this index
You tried to insert a document ID that’s already in the LSH index. Either check before inserting or wrap in a try/except:
| |
Empty query results when you expect matches
Usually a threshold problem. Lower the threshold or increase num_perm for better accuracy. Also check that you’re shingling consistently – if one document is lowercased and another isn’t, their shingle sets won’t overlap.
MemoryError with large datasets
Switch to Redis-backed storage (shown above) or process your data in chunks. You can also reduce num_perm to 64 at the cost of some accuracy – for most deduplication tasks the difference between 64 and 128 permutations is small.
Slow inserts with high num_perm
The LSH index partitions the signature into bands. More permutations mean more bands to hash. If insert speed matters more than recall, drop to 64 permutations and tighten the threshold slightly to compensate.
Related Guides
- How to Build a Text-to-Knowledge-Graph Pipeline with SpaCy and NetworkX
- How to Build a Text Entailment and Contradiction Detection Pipeline
- How to Build a Text Summarization Pipeline with Sumy and Transformers
- How to Build a Keyphrase Generation Pipeline with KeyphraseVectorizers
- How to Build a Text Anonymization Pipeline with Presidio and spaCy
- How to Build a Sentiment-Aware Search Pipeline with Embeddings
- How to Build an Emotion Detection Pipeline with GoEmotions and Transformers
- How to Build an Abstractive Summarization Pipeline with PEGASUS
- How to Build an Aspect-Based Sentiment Analysis Pipeline
- How to Build a Text Chunking and Splitting Pipeline for RAG