How to Build a Text Deduplication Pipeline with MinHash and LSH

Exact string matching catches identical duplicates, but most real-world duplicates are near-duplicates: slightly reworded paragraphs, reformatted text, or copy-paste with minor edits. Comparing every pair of documents with cosine similarity is O(n^2) and falls apart past a few thousand texts. MinHash with LSH solves this. You get approximate Jaccard similarity in sub-linear time, which means you can deduplicate millions of documents without waiting days.

Install the one library you need:

1
pip install datasketch

Here’s the fastest path to a working deduplication pipeline:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
from datasketch import MinHash, MinHashLSH

documents = [
    "The quick brown fox jumps over the lazy dog",
    "The quick brown fox jumped over the lazy dog",
    "A completely different sentence about machine learning",
    "The quick brown fox jumps over the lazy dog today",
    "Machine learning is a subset of artificial intelligence",
]

def shingle(text: str, k: int = 3) -> set:
    """Split text into character-level k-shingles."""
    text = text.lower().strip()
    return {text[i:i + k] for i in range(len(text) - k + 1)}

def build_minhash(shingles: set, num_perm: int = 128) -> MinHash:
    """Create a MinHash signature from a set of shingles."""
    m = MinHash(num_perm=num_perm)
    for s in shingles:
        m.update(s.encode("utf-8"))
    return m

# Build LSH index with 0.5 Jaccard similarity threshold
lsh = MinHashLSH(threshold=0.5, num_perm=128)
minhashes = {}

for i, doc in enumerate(documents):
    doc_id = f"doc_{i}"
    m = build_minhash(shingle(doc))
    minhashes[doc_id] = m
    lsh.insert(doc_id, m)

# Query for near-duplicates of the first document
result = lsh.query(minhashes["doc_0"])
print(f"Near-duplicates of doc_0: {result}")
# Near-duplicates of doc_0: ['doc_0', 'doc_1', 'doc_3']

Those three documents are all variations of the “quick brown fox” sentence. The third and fifth documents (completely different topics) don’t show up. That’s the pipeline in 30 lines.

How MinHash and LSH Actually Work

MinHash estimates Jaccard similarity between two sets without comparing them element by element. The Jaccard similarity of sets A and B is |A intersect B| / |A union B|. MinHash applies multiple random hash functions to each set, keeping only the minimum hash value from each function. The fraction of matching minimums across two signatures approximates their Jaccard similarity.

LSH takes this further. It groups MinHash signatures into bands and hashes each band separately. If two documents share at least one identical band hash, they become candidate pairs. This turns an O(n^2) comparison problem into roughly O(n) with high probability of catching true duplicates.

The key parameters:

num_perm (number of permutations): Higher values give more accurate similarity estimates but use more memory. 128 is a solid default. Go to 256 if you need precision.
threshold: The Jaccard similarity cutoff for considering two documents as duplicates. 0.5 is aggressive (catches loose paraphrases), 0.8 is conservative (only near-identical text).
Shingle size (k): Smaller k (2-3) catches more overlap but increases false positives. Larger k (5-9) is stricter. For sentences, use k=3. For full documents, use k=5 or higher.

Building the Full Pipeline

A production pipeline needs to handle document loading, clustering duplicates together, and picking a representative from each cluster. Here’s a complete version:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
from datasketch import MinHash, MinHashLSH
from collections import defaultdict

def shingle(text: str, k: int = 3) -> set:
    text = text.lower().strip()
    tokens = text.split()
    # Word-level shingles for longer documents
    if len(tokens) > 20:
        return {" ".join(tokens[i:i + k]) for i in range(len(tokens) - k + 1)}
    # Character-level shingles for short texts
    return {text[i:i + k] for i in range(len(text) - k + 1)}

def build_minhash(shingles: set, num_perm: int = 128) -> MinHash:
    m = MinHash(num_perm=num_perm)
    for s in shingles:
        m.update(s.encode("utf-8"))
    return m

def deduplicate(
    documents: dict[str, str],
    threshold: float = 0.5,
    num_perm: int = 128,
    shingle_k: int = 3,
) -> dict:
    """
    Deduplicate a dict of {doc_id: text}.
    Returns dict with 'clusters' and 'unique_docs'.
    """
    lsh = MinHashLSH(threshold=threshold, num_perm=num_perm)
    minhashes = {}

    for doc_id, text in documents.items():
        m = build_minhash(shingle(text, k=shingle_k), num_perm=num_perm)
        minhashes[doc_id] = m
        try:
            lsh.insert(doc_id, m)
        except ValueError:
            # Duplicate key already in index — skip
            pass

    # Find clusters via connected components
    visited = set()
    clusters = []

    for doc_id in documents:
        if doc_id in visited:
            continue
        neighbors = lsh.query(minhashes[doc_id])
        cluster = set(neighbors)
        visited.update(cluster)
        clusters.append(cluster)

    # Pick the longest document in each cluster as representative
    unique_docs = {}
    for cluster in clusters:
        representative = max(cluster, key=lambda d: len(documents[d]))
        unique_docs[representative] = documents[representative]

    return {
        "clusters": clusters,
        "unique_docs": unique_docs,
        "removed": len(documents) - len(unique_docs),
    }

# Example usage
docs = {
    "a1": "The quick brown fox jumps over the lazy dog",
    "a2": "The quick brown fox jumped over the lazy dog near the river",
    "a3": "A completely different sentence about machine learning models",
    "a4": "Quick brown fox jumps over lazy dog",
    "b1": "Machine learning models require large datasets for training",
    "b2": "Machine learning models need large datasets to train properly",
    "c1": "Python is a popular programming language for data science",
}

result = deduplicate(docs, threshold=0.5)
print(f"Original: {len(docs)} docs")
print(f"After dedup: {len(result['unique_docs'])} docs")
print(f"Removed: {result['removed']} duplicates")
print(f"Clusters: {result['clusters']}")

The deduplicate function groups documents into clusters using LSH neighbor queries, then picks the longest document from each cluster as the representative. You could swap that selection logic for whatever makes sense in your domain – newest document, highest quality score, or the one with the most metadata.

Tuning the Threshold

Picking the right threshold matters a lot. Too low and you merge unrelated documents. Too high and you miss real duplicates. Run a quick sweep on a sample of your data:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
from datasketch import MinHash

def jaccard_via_minhash(text_a: str, text_b: str, num_perm: int = 256) -> float:
    """Estimate Jaccard similarity between two texts using MinHash."""
    def make_shingles(text, k=3):
        text = text.lower().strip()
        return {text[i:i + k] for i in range(len(text) - k + 1)}

    m1 = MinHash(num_perm=num_perm)
    for s in make_shingles(text_a):
        m1.update(s.encode("utf-8"))

    m2 = MinHash(num_perm=num_perm)
    for s in make_shingles(text_b):
        m2.update(s.encode("utf-8"))

    return m1.jaccard(m2)

# Check similarity between known pairs
pairs = [
    ("The quick brown fox jumps", "The quick brown fox jumped"),
    ("Python is great for ML", "JavaScript is great for web dev"),
    ("Data cleaning is important", "Data cleaning is very important for ML"),
]

for a, b in pairs:
    sim = jaccard_via_minhash(a, b)
    print(f"Similarity: {sim:.3f} | '{a[:40]}' vs '{b[:40]}'")

Use the output to pick a threshold that separates your true duplicates from false positives. I find 0.5 works well for catching paraphrases and 0.7-0.8 for catching near-identical copies.

Scaling to Large Datasets

The in-memory MinHashLSH works fine up to a few million documents on a machine with 16GB+ RAM. Each MinHash signature with 128 permutations uses about 1KB. So a million documents costs roughly 1GB just for signatures.

For bigger datasets, datasketch supports Redis-backed storage:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
from datasketch import MinHashLSH

# Redis-backed LSH for datasets that don't fit in memory
lsh = MinHashLSH(
    threshold=0.5,
    num_perm=128,
    storage_config={
        "type": "redis",
        "redis": {"host": "localhost", "port": 6379},
    },
)

A few tips for scale:

Batch your inserts. Building all MinHash signatures first, then inserting into LSH, is faster than interleaving.
Use word-level shingles for long documents. Character shingles on a 10,000-word document produce huge shingle sets. Word-level 3-grams keep the set size proportional to word count.
Parallelize MinHash computation. The signature building step is embarrassingly parallel. Use multiprocessing.Pool or concurrent.futures:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from concurrent.futures import ProcessPoolExecutor
from datasketch import MinHash

def compute_signature(args):
    doc_id, text, num_perm, shingle_k = args
    text = text.lower().strip()
    shingles = {text[i:i + shingle_k] for i in range(len(text) - shingle_k + 1)}
    m = MinHash(num_perm=num_perm)
    for s in shingles:
        m.update(s.encode("utf-8"))
    return doc_id, m

documents = {
    "doc_0": "First document text here with enough content to shingle",
    "doc_1": "Second document that is somewhat similar to the first one",
    "doc_2": "Third document with entirely different subject matter",
}

tasks = [(did, text, 128, 3) for did, text in documents.items()]

with ProcessPoolExecutor(max_workers=4) as pool:
    results = list(pool.map(compute_signature, tasks))

minhashes = dict(results)
print(f"Computed {len(minhashes)} signatures in parallel")

Common Errors and Fixes

ValueError: The num_perm of input MinHash must be equal to...

This happens when you mix MinHash signatures with different num_perm values. Every signature in the same LSH index must use the same num_perm. Fix it by passing the value explicitly everywhere:

1
2
3
NUM_PERM = 128  # Define once, use everywhere
lsh = MinHashLSH(threshold=0.5, num_perm=NUM_PERM)
m = MinHash(num_perm=NUM_PERM)  # Must match

ValueError: ... already exists in this index

You tried to insert a document ID that’s already in the LSH index. Either check before inserting or wrap in a try/except:

1
2
3
4
try:
    lsh.insert(doc_id, minhash)
except ValueError:
    pass  # Already indexed

Empty query results when you expect matches

Usually a threshold problem. Lower the threshold or increase num_perm for better accuracy. Also check that you’re shingling consistently – if one document is lowercased and another isn’t, their shingle sets won’t overlap.

MemoryError with large datasets

Switch to Redis-backed storage (shown above) or process your data in chunks. You can also reduce num_perm to 64 at the cost of some accuracy – for most deduplication tasks the difference between 64 and 128 permutations is small.

Slow inserts with high num_perm

The LSH index partitions the signature into bands. More permutations mean more bands to hash. If insert speed matters more than recall, drop to 64 permutations and tighten the threshold slightly to compensate.

How MinHash and LSH Actually Work#

Building the Full Pipeline#

Tuning the Threshold#

Scaling to Large Datasets#

Common Errors and Fixes#

Related Guides#

About the Author

How MinHash and LSH Actually Work

Building the Full Pipeline

Tuning the Threshold

Scaling to Large Datasets

Common Errors and Fixes

Related Guides