Training on copyrighted text without a license is a lawsuit waiting to happen. The smart move is to scan your dataset before training starts, flag anything that looks like it came from a known copyrighted source, and generate a report you can hand to legal. Here is a practical pipeline that does exactly that using n-gram fingerprinting, MinHash/LSH for fast approximate matching, and fuzzy matching for fine-grained scoring.
The core idea: break every document into overlapping n-gram shingles, hash them into compact signatures, then compare those signatures against a reference index of known copyrighted works. Matches above a threshold get flagged.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| import hashlib
from collections import defaultdict
def shingle(text: str, n: int = 5) -> set[str]:
"""Break text into overlapping word n-grams."""
words = text.lower().split()
if len(words) < n:
return {" ".join(words)}
return {" ".join(words[i:i + n]) for i in range(len(words) - n + 1)}
# Quick demo
sample = "The quick brown fox jumps over the lazy dog near the river"
shingles = shingle(sample, n=3)
print(f"Generated {len(shingles)} shingles")
print(list(shingles)[:5])
|
That gives you the building blocks. Now let’s turn it into something useful.
Building a Reference Corpus Fingerprint Index#
You need a reference set of known copyrighted works – books, articles, song lyrics, whatever your legal team cares about. Each work gets fingerprinted and stored in a MinHash index for fast lookup.
datasketch handles the heavy lifting. It implements MinHash signatures and Locality-Sensitive Hashing (LSH) so you can find near-duplicates without comparing every pair of documents.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
| from datasketch import MinHash, MinHashLSH
def create_minhash(shingle_set: set[str], num_perm: int = 128) -> MinHash:
"""Create a MinHash signature from a set of shingles."""
m = MinHash(num_perm=num_perm)
for s in shingle_set:
m.update(s.encode("utf-8"))
return m
class CopyrightIndex:
"""Index of known copyrighted works for fast similarity search."""
def __init__(self, threshold: float = 0.3, num_perm: int = 128):
self.threshold = threshold
self.num_perm = num_perm
self.lsh = MinHashLSH(threshold=threshold, num_perm=num_perm)
self.metadata = {}
def add_work(self, work_id: str, text: str, title: str = "", author: str = ""):
"""Add a copyrighted work to the index."""
shingles = shingle(text, n=5)
mh = create_minhash(shingles, self.num_perm)
self.lsh.insert(work_id, mh)
self.metadata[work_id] = {
"title": title,
"author": author,
"shingles": shingles,
"minhash": mh,
}
def query(self, text: str) -> list[str]:
"""Find candidate matches for a text sample."""
shingles = shingle(text, n=5)
mh = create_minhash(shingles, self.num_perm)
return self.lsh.query(mh)
|
The threshold parameter controls sensitivity. A value of 0.3 catches documents sharing roughly 30% of their shingles with a reference work. That is deliberately low – you want to cast a wide net at this stage and refine later with fuzzy matching.
Populating the Index#
Load your reference corpus and index it. This works with any text source – a folder of .txt files, a database, a CSV dump.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
| import os
def build_index_from_directory(directory: str, threshold: float = 0.3) -> CopyrightIndex:
"""Build a copyright index from a directory of text files."""
index = CopyrightIndex(threshold=threshold)
for filename in os.listdir(directory):
if not filename.endswith(".txt"):
continue
filepath = os.path.join(directory, filename)
with open(filepath, "r", encoding="utf-8") as f:
text = f.read()
work_id = filename.replace(".txt", "")
index.add_work(work_id, text, title=work_id)
print(f"Indexed: {work_id} ({len(text)} chars)")
print(f"Index built with {len(index.metadata)} works, threshold={threshold}")
return index
|
Scanning Training Data Against the Index#
Now scan your actual training data. For each sample, query the LSH index for candidates, then compute exact Jaccard similarity against each candidate.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
| def scan_dataset(
samples: list[dict],
index: CopyrightIndex,
text_key: str = "text",
id_key: str = "id",
) -> list[dict]:
"""Scan a list of training samples against the copyright index."""
flags = []
for sample in samples:
text = sample[text_key]
sample_id = sample.get(id_key, "unknown")
candidates = index.query(text)
if not candidates:
continue
sample_shingles = shingle(text, n=5)
sample_mh = create_minhash(sample_shingles, index.num_perm)
for candidate_id in candidates:
ref = index.metadata[candidate_id]
# Exact Jaccard from MinHash estimation
jaccard = sample_mh.jaccard(ref["minhash"])
flags.append({
"sample_id": sample_id,
"matched_work": candidate_id,
"title": ref["title"],
"author": ref["author"],
"jaccard_similarity": round(jaccard, 4),
"sample_length": len(text),
})
return sorted(flags, key=lambda x: x["jaccard_similarity"], reverse=True)
|
This returns a list of flagged samples ranked by similarity. The MinHash Jaccard estimate is accurate enough for triage – anything above 0.5 is worth a closer look.
Fuzzy Matching for Fine-Grained Scoring#
LSH gives you speed but not precision. For the flagged candidates, run a second pass with difflib.SequenceMatcher to get exact overlap ratios. This catches cases where someone rearranged paragraphs or swapped a few words.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
| from difflib import SequenceMatcher
def fuzzy_score(text_a: str, text_b: str) -> dict:
"""Compute fuzzy match metrics between two texts."""
matcher = SequenceMatcher(None, text_a.lower(), text_b.lower())
ratio = matcher.ratio()
blocks = matcher.get_matching_blocks()
# Find the longest contiguous match
longest_block = max(blocks, key=lambda b: b.size) if blocks else None
longest_match_len = longest_block.size if longest_block else 0
return {
"similarity_ratio": round(ratio, 4),
"longest_contiguous_match": longest_match_len,
"matching_blocks": len([b for b in blocks if b.size > 0]),
}
def refine_flags(
flags: list[dict],
index: CopyrightIndex,
samples_by_id: dict[str, str],
similarity_cutoff: float = 0.15,
) -> list[dict]:
"""Run fuzzy matching on flagged samples for precise scoring."""
refined = []
for flag in flags:
sample_text = samples_by_id.get(flag["sample_id"], "")
ref = index.metadata.get(flag["matched_work"], {})
# Use a truncated window for fuzzy matching (SequenceMatcher is O(n^2))
sample_window = sample_text[:5000]
# Compare against chunks of the reference to find the best match
ref_shingles_text = " ".join(ref.get("shingles", set()))
ref_window = ref_shingles_text[:5000]
scores = fuzzy_score(sample_window, ref_window)
if scores["similarity_ratio"] >= similarity_cutoff:
flag["fuzzy_similarity"] = scores["similarity_ratio"]
flag["longest_contiguous_match"] = scores["longest_contiguous_match"]
flag["matching_blocks"] = scores["matching_blocks"]
refined.append(flag)
return sorted(refined, key=lambda x: x["fuzzy_similarity"], reverse=True)
|
The similarity_cutoff is your knob. Start low (0.15) and adjust based on how many false positives your legal team can stomach. In practice, anything above 0.4 fuzzy similarity is almost certainly a verbatim or near-verbatim copy.
Generating Compliance Reports#
Once you have refined flags, produce a structured report. JSON works for downstream tooling, but a human-readable summary helps legal review.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
| import json
from datetime import datetime
def generate_compliance_report(
refined_flags: list[dict],
dataset_name: str,
total_samples: int,
output_path: str = "copyright_report.json",
) -> dict:
"""Generate a compliance report from flagged samples."""
high_risk = [f for f in refined_flags if f.get("fuzzy_similarity", 0) >= 0.4]
medium_risk = [f for f in refined_flags if 0.2 <= f.get("fuzzy_similarity", 0) < 0.4]
low_risk = [f for f in refined_flags if f.get("fuzzy_similarity", 0) < 0.2]
report = {
"report_date": datetime.now().isoformat(),
"dataset": dataset_name,
"total_samples_scanned": total_samples,
"total_flags": len(refined_flags),
"risk_breakdown": {
"high": len(high_risk),
"medium": len(medium_risk),
"low": len(low_risk),
},
"flagged_samples": refined_flags,
}
with open(output_path, "w", encoding="utf-8") as f:
json.dump(report, f, indent=2)
print(f"Report saved to {output_path}")
print(f" Total scanned: {total_samples}")
print(f" Flagged: {len(refined_flags)}")
print(f" High risk: {len(high_risk)}")
print(f" Medium risk: {len(medium_risk)}")
print(f" Low risk: {len(low_risk)}")
return report
|
Putting It All Together#
Here is a full end-to-end run so you can see how the pieces fit:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
| # 1. Build the reference index
index = CopyrightIndex(threshold=0.3, num_perm=128)
index.add_work("book-001", "This is the full text of a copyrighted novel...", title="Example Novel", author="Jane Author")
index.add_work("article-042", "Full text of a copyrighted research paper...", title="ML Survey 2025", author="Smith et al.")
# 2. Prepare training samples
training_samples = [
{"id": "sample-0001", "text": "This is the full text of a copyrighted novel with minor edits..."},
{"id": "sample-0002", "text": "Completely original text about training neural networks..."},
{"id": "sample-0003", "text": "Another passage that partially overlaps with ML Survey 2025..."},
]
# 3. Scan
flags = scan_dataset(training_samples, index)
print(f"Initial flags: {len(flags)}")
# 4. Refine with fuzzy matching
samples_lookup = {s["id"]: s["text"] for s in training_samples}
refined = refine_flags(flags, index, samples_lookup, similarity_cutoff=0.15)
print(f"Refined flags: {len(refined)}")
# 5. Generate report
report = generate_compliance_report(refined, "my-training-set-v2", len(training_samples))
|
Common Errors and Fixes#
datasketch not installed#
If you need the Redis backend for large-scale indexes:
1
| pip install "datasketch[redis]"
|
LSH threshold too high, missing obvious matches#
A threshold of 0.5 or above in MinHashLSH means the LSH only returns candidates with at least ~50% estimated overlap. For copyright detection you want to catch partial matches too. Start at 0.3 and lower it if you are missing known infringements in your test set.
SequenceMatcher is slow on long documents#
difflib.SequenceMatcher has quadratic time complexity. Never feed it full-length documents. Truncate to 5,000 characters or chunk the documents into paragraphs and match paragraph-by-paragraph. For production workloads, consider rapidfuzz as a drop-in replacement – it is written in C++ and runs 10-100x faster.
1
2
3
4
| from rapidfuzz.fuzz import ratio
score = ratio("text sample one", "text sample two")
print(f"Similarity: {score}%")
|
MinHash num_perm too low gives noisy estimates#
Using fewer than 128 permutations makes the Jaccard estimate unreliable. For copyright detection, 128 is a good default. Bump to 256 if you need tighter accuracy and can afford the memory. Going below 64 will produce too many false positives and false negatives to be useful.
Handling Unicode and encoding errors#
Training data scraped from the web often has mixed encodings. Normalize before shingling:
1
2
3
4
5
6
| import unicodedata
def normalize_text(text: str) -> str:
"""Normalize unicode and collapse whitespace."""
text = unicodedata.normalize("NFKD", text)
return " ".join(text.split())
|
Call normalize_text() on both the reference works and training samples before passing them to shingle(). Skipping this step leads to missed matches where the only difference is a fancy quote character or a non-breaking space.