Training on copyrighted text without a license is a lawsuit waiting to happen. The smart move is to scan your dataset before training starts, flag anything that looks like it came from a known copyrighted source, and generate a report you can hand to legal. Here is a practical pipeline that does exactly that using n-gram fingerprinting, MinHash/LSH for fast approximate matching, and fuzzy matching for fine-grained scoring.

The core idea: break every document into overlapping n-gram shingles, hash them into compact signatures, then compare those signatures against a reference index of known copyrighted works. Matches above a threshold get flagged.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
import hashlib
from collections import defaultdict

def shingle(text: str, n: int = 5) -> set[str]:
    """Break text into overlapping word n-grams."""
    words = text.lower().split()
    if len(words) < n:
        return {" ".join(words)}
    return {" ".join(words[i:i + n]) for i in range(len(words) - n + 1)}

# Quick demo
sample = "The quick brown fox jumps over the lazy dog near the river"
shingles = shingle(sample, n=3)
print(f"Generated {len(shingles)} shingles")
print(list(shingles)[:5])

That gives you the building blocks. Now let’s turn it into something useful.

Building a Reference Corpus Fingerprint Index

You need a reference set of known copyrighted works – books, articles, song lyrics, whatever your legal team cares about. Each work gets fingerprinted and stored in a MinHash index for fast lookup.

datasketch handles the heavy lifting. It implements MinHash signatures and Locality-Sensitive Hashing (LSH) so you can find near-duplicates without comparing every pair of documents.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
from datasketch import MinHash, MinHashLSH

def create_minhash(shingle_set: set[str], num_perm: int = 128) -> MinHash:
    """Create a MinHash signature from a set of shingles."""
    m = MinHash(num_perm=num_perm)
    for s in shingle_set:
        m.update(s.encode("utf-8"))
    return m

class CopyrightIndex:
    """Index of known copyrighted works for fast similarity search."""

    def __init__(self, threshold: float = 0.3, num_perm: int = 128):
        self.threshold = threshold
        self.num_perm = num_perm
        self.lsh = MinHashLSH(threshold=threshold, num_perm=num_perm)
        self.metadata = {}

    def add_work(self, work_id: str, text: str, title: str = "", author: str = ""):
        """Add a copyrighted work to the index."""
        shingles = shingle(text, n=5)
        mh = create_minhash(shingles, self.num_perm)
        self.lsh.insert(work_id, mh)
        self.metadata[work_id] = {
            "title": title,
            "author": author,
            "shingles": shingles,
            "minhash": mh,
        }

    def query(self, text: str) -> list[str]:
        """Find candidate matches for a text sample."""
        shingles = shingle(text, n=5)
        mh = create_minhash(shingles, self.num_perm)
        return self.lsh.query(mh)

The threshold parameter controls sensitivity. A value of 0.3 catches documents sharing roughly 30% of their shingles with a reference work. That is deliberately low – you want to cast a wide net at this stage and refine later with fuzzy matching.

Populating the Index

Load your reference corpus and index it. This works with any text source – a folder of .txt files, a database, a CSV dump.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
import os

def build_index_from_directory(directory: str, threshold: float = 0.3) -> CopyrightIndex:
    """Build a copyright index from a directory of text files."""
    index = CopyrightIndex(threshold=threshold)
    for filename in os.listdir(directory):
        if not filename.endswith(".txt"):
            continue
        filepath = os.path.join(directory, filename)
        with open(filepath, "r", encoding="utf-8") as f:
            text = f.read()
        work_id = filename.replace(".txt", "")
        index.add_work(work_id, text, title=work_id)
        print(f"Indexed: {work_id} ({len(text)} chars)")
    print(f"Index built with {len(index.metadata)} works, threshold={threshold}")
    return index

Scanning Training Data Against the Index

Now scan your actual training data. For each sample, query the LSH index for candidates, then compute exact Jaccard similarity against each candidate.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
def scan_dataset(
    samples: list[dict],
    index: CopyrightIndex,
    text_key: str = "text",
    id_key: str = "id",
) -> list[dict]:
    """Scan a list of training samples against the copyright index."""
    flags = []
    for sample in samples:
        text = sample[text_key]
        sample_id = sample.get(id_key, "unknown")
        candidates = index.query(text)

        if not candidates:
            continue

        sample_shingles = shingle(text, n=5)
        sample_mh = create_minhash(sample_shingles, index.num_perm)

        for candidate_id in candidates:
            ref = index.metadata[candidate_id]
            # Exact Jaccard from MinHash estimation
            jaccard = sample_mh.jaccard(ref["minhash"])

            flags.append({
                "sample_id": sample_id,
                "matched_work": candidate_id,
                "title": ref["title"],
                "author": ref["author"],
                "jaccard_similarity": round(jaccard, 4),
                "sample_length": len(text),
            })

    return sorted(flags, key=lambda x: x["jaccard_similarity"], reverse=True)

This returns a list of flagged samples ranked by similarity. The MinHash Jaccard estimate is accurate enough for triage – anything above 0.5 is worth a closer look.

Fuzzy Matching for Fine-Grained Scoring

LSH gives you speed but not precision. For the flagged candidates, run a second pass with difflib.SequenceMatcher to get exact overlap ratios. This catches cases where someone rearranged paragraphs or swapped a few words.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
from difflib import SequenceMatcher

def fuzzy_score(text_a: str, text_b: str) -> dict:
    """Compute fuzzy match metrics between two texts."""
    matcher = SequenceMatcher(None, text_a.lower(), text_b.lower())
    ratio = matcher.ratio()
    blocks = matcher.get_matching_blocks()

    # Find the longest contiguous match
    longest_block = max(blocks, key=lambda b: b.size) if blocks else None
    longest_match_len = longest_block.size if longest_block else 0

    return {
        "similarity_ratio": round(ratio, 4),
        "longest_contiguous_match": longest_match_len,
        "matching_blocks": len([b for b in blocks if b.size > 0]),
    }

def refine_flags(
    flags: list[dict],
    index: CopyrightIndex,
    samples_by_id: dict[str, str],
    similarity_cutoff: float = 0.15,
) -> list[dict]:
    """Run fuzzy matching on flagged samples for precise scoring."""
    refined = []
    for flag in flags:
        sample_text = samples_by_id.get(flag["sample_id"], "")
        ref = index.metadata.get(flag["matched_work"], {})

        # Use a truncated window for fuzzy matching (SequenceMatcher is O(n^2))
        sample_window = sample_text[:5000]
        # Compare against chunks of the reference to find the best match
        ref_shingles_text = " ".join(ref.get("shingles", set()))
        ref_window = ref_shingles_text[:5000]

        scores = fuzzy_score(sample_window, ref_window)

        if scores["similarity_ratio"] >= similarity_cutoff:
            flag["fuzzy_similarity"] = scores["similarity_ratio"]
            flag["longest_contiguous_match"] = scores["longest_contiguous_match"]
            flag["matching_blocks"] = scores["matching_blocks"]
            refined.append(flag)

    return sorted(refined, key=lambda x: x["fuzzy_similarity"], reverse=True)

The similarity_cutoff is your knob. Start low (0.15) and adjust based on how many false positives your legal team can stomach. In practice, anything above 0.4 fuzzy similarity is almost certainly a verbatim or near-verbatim copy.

Generating Compliance Reports

Once you have refined flags, produce a structured report. JSON works for downstream tooling, but a human-readable summary helps legal review.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import json
from datetime import datetime

def generate_compliance_report(
    refined_flags: list[dict],
    dataset_name: str,
    total_samples: int,
    output_path: str = "copyright_report.json",
) -> dict:
    """Generate a compliance report from flagged samples."""
    high_risk = [f for f in refined_flags if f.get("fuzzy_similarity", 0) >= 0.4]
    medium_risk = [f for f in refined_flags if 0.2 <= f.get("fuzzy_similarity", 0) < 0.4]
    low_risk = [f for f in refined_flags if f.get("fuzzy_similarity", 0) < 0.2]

    report = {
        "report_date": datetime.now().isoformat(),
        "dataset": dataset_name,
        "total_samples_scanned": total_samples,
        "total_flags": len(refined_flags),
        "risk_breakdown": {
            "high": len(high_risk),
            "medium": len(medium_risk),
            "low": len(low_risk),
        },
        "flagged_samples": refined_flags,
    }

    with open(output_path, "w", encoding="utf-8") as f:
        json.dump(report, f, indent=2)

    print(f"Report saved to {output_path}")
    print(f"  Total scanned: {total_samples}")
    print(f"  Flagged: {len(refined_flags)}")
    print(f"  High risk: {len(high_risk)}")
    print(f"  Medium risk: {len(medium_risk)}")
    print(f"  Low risk: {len(low_risk)}")

    return report

Putting It All Together

Here is a full end-to-end run so you can see how the pieces fit:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# 1. Build the reference index
index = CopyrightIndex(threshold=0.3, num_perm=128)
index.add_work("book-001", "This is the full text of a copyrighted novel...", title="Example Novel", author="Jane Author")
index.add_work("article-042", "Full text of a copyrighted research paper...", title="ML Survey 2025", author="Smith et al.")

# 2. Prepare training samples
training_samples = [
    {"id": "sample-0001", "text": "This is the full text of a copyrighted novel with minor edits..."},
    {"id": "sample-0002", "text": "Completely original text about training neural networks..."},
    {"id": "sample-0003", "text": "Another passage that partially overlaps with ML Survey 2025..."},
]

# 3. Scan
flags = scan_dataset(training_samples, index)
print(f"Initial flags: {len(flags)}")

# 4. Refine with fuzzy matching
samples_lookup = {s["id"]: s["text"] for s in training_samples}
refined = refine_flags(flags, index, samples_lookup, similarity_cutoff=0.15)
print(f"Refined flags: {len(refined)}")

# 5. Generate report
report = generate_compliance_report(refined, "my-training-set-v2", len(training_samples))

Common Errors and Fixes

datasketch not installed

1
pip install datasketch

If you need the Redis backend for large-scale indexes:

1
pip install "datasketch[redis]"

LSH threshold too high, missing obvious matches

A threshold of 0.5 or above in MinHashLSH means the LSH only returns candidates with at least ~50% estimated overlap. For copyright detection you want to catch partial matches too. Start at 0.3 and lower it if you are missing known infringements in your test set.

SequenceMatcher is slow on long documents

difflib.SequenceMatcher has quadratic time complexity. Never feed it full-length documents. Truncate to 5,000 characters or chunk the documents into paragraphs and match paragraph-by-paragraph. For production workloads, consider rapidfuzz as a drop-in replacement – it is written in C++ and runs 10-100x faster.

1
pip install rapidfuzz
1
2
3
4
from rapidfuzz.fuzz import ratio

score = ratio("text sample one", "text sample two")
print(f"Similarity: {score}%")

MinHash num_perm too low gives noisy estimates

Using fewer than 128 permutations makes the Jaccard estimate unreliable. For copyright detection, 128 is a good default. Bump to 256 if you need tighter accuracy and can afford the memory. Going below 64 will produce too many false positives and false negatives to be useful.

Handling Unicode and encoding errors

Training data scraped from the web often has mixed encodings. Normalize before shingling:

1
2
3
4
5
6
import unicodedata

def normalize_text(text: str) -> str:
    """Normalize unicode and collapse whitespace."""
    text = unicodedata.normalize("NFKD", text)
    return " ".join(text.split())

Call normalize_text() on both the reference works and training samples before passing them to shingle(). Skipping this step leads to missed matches where the only difference is a fancy quote character or a non-breaking space.