Why Contamination Wrecks Your Eval Numbers

If your training data contains examples from the benchmark you evaluate on, your scores are lying. The model memorized the answers instead of learning to generalize. This is data contamination, and it’s more common than you’d think – web-scraped corpora regularly contain verbatim copies of MMLU questions, GSM8K problems, and HumanEval solutions. Papers have found contamination rates of 1-10% in popular pretraining datasets.

You need two detection approaches: exact n-gram overlap catches verbatim leaks, and embedding similarity catches paraphrased or reformatted versions of benchmark items. Here’s a pipeline that does both.

Install the dependencies:

1
pip install sentence-transformers datasets numpy

N-Gram Overlap Detection

N-gram overlap is the fastest contamination signal. You extract n-grams from both your training data and the benchmark, then check how many benchmark n-grams appear in each training sample. A high overlap ratio means that sample is likely contaminated.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
from collections import Counter
from datasets import load_dataset

def extract_ngrams(text: str, n: int = 8) -> set:
    """Extract character-level n-grams from text."""
    text = text.lower().strip()
    tokens = text.split()
    if len(tokens) < n:
        return set()
    return {" ".join(tokens[i:i + n]) for i in range(len(tokens) - n + 1)}

def ngram_overlap_score(train_text: str, benchmark_ngrams: set, n: int = 8) -> float:
    """Calculate fraction of a training sample's n-grams found in the benchmark set."""
    sample_ngrams = extract_ngrams(train_text, n=n)
    if not sample_ngrams:
        return 0.0
    overlap = sample_ngrams & benchmark_ngrams
    return len(overlap) / len(sample_ngrams)

# Load a benchmark dataset (using HellaSwag as an example)
benchmark = load_dataset("Rowan/hellaswag", split="validation")

# Build a set of all benchmark n-grams
all_benchmark_ngrams = set()
for example in benchmark:
    text = example["ctx"]
    all_benchmark_ngrams.update(extract_ngrams(text, n=8))

print(f"Benchmark n-gram pool: {len(all_benchmark_ngrams):,} unique 8-grams")

# Score some training samples against the benchmark
training_samples = [
    "A woman is seen standing in a kitchen. She picks up a knife and begins chopping vegetables on a cutting board.",
    "The transformer architecture uses self-attention mechanisms to process sequences in parallel rather than sequentially.",
    "A man is seen sitting on a couch. He picks up a remote and turns on the television to watch the news.",
]

for i, sample in enumerate(training_samples):
    score = ngram_overlap_score(sample, all_benchmark_ngrams, n=8)
    print(f"Sample {i}: overlap={score:.4f} {'FLAGGED' if score > 0.5 else 'clean'}")

The n-gram size matters. Using 8-grams (sequences of 8 words) is the standard from GPT-3’s contamination analysis. Shorter n-grams produce false positives from common phrases. Longer n-grams miss contamination with minor edits. A score above 0.5 means more than half of the sample’s 8-grams appear verbatim in the benchmark – that’s almost certainly a leak.

Embedding Similarity for Fuzzy Contamination

N-grams miss paraphrased contamination. Someone reformats a benchmark question, changes a few words, and the n-gram check passes. Embedding similarity catches these cases by comparing semantic meaning rather than surface text.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
import numpy as np
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

# Encode benchmark items
benchmark_texts = [
    "What is the capital of France?",
    "Solve for x: 2x + 5 = 15",
    "Write a function that returns the nth Fibonacci number.",
]
benchmark_embeddings = model.encode(benchmark_texts, normalize_embeddings=True)

# Encode training samples to check
training_texts = [
    "Name the capital city of France.",  # paraphrase of benchmark item 0
    "The weather in Paris is mild in spring.",  # unrelated
    "Implement a Python function to compute the nth number in the Fibonacci sequence.",  # paraphrase of item 2
    "Distributed training with DeepSpeed requires proper configuration of the optimizer.",  # unrelated
]
training_embeddings = model.encode(training_texts, normalize_embeddings=True)

# Compute cosine similarity (dot product since embeddings are normalized)
similarity_matrix = training_embeddings @ benchmark_embeddings.T

THRESHOLD = 0.85

for i, text in enumerate(training_texts):
    max_sim = float(np.max(similarity_matrix[i]))
    best_match = int(np.argmax(similarity_matrix[i]))
    status = "FLAGGED" if max_sim > THRESHOLD else "clean"
    print(f"Training[{i}]: max_sim={max_sim:.3f} -> benchmark[{best_match}] [{status}]")
    if max_sim > THRESHOLD:
        print(f"  Train: {text}")
        print(f"  Bench: {benchmark_texts[best_match]}")

The threshold of 0.85 works well for all-MiniLM-L6-v2. If you use a larger model like all-mpnet-base-v2, you may need to push it to 0.88-0.90 because the embedding space is more spread out. Always calibrate on known positive/negative pairs from your specific benchmark.

Combined Pipeline with Contamination Report

A real pipeline combines both signals into a single contamination score and generates a report you can review before training.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
import json
from collections import Counter
from dataclasses import dataclass, asdict
from typing import Optional

import numpy as np
from sentence_transformers import SentenceTransformer
from datasets import load_dataset

@dataclass
class ContaminationResult:
    sample_index: int
    text_preview: str
    ngram_score: float
    embedding_score: float
    combined_score: float
    matched_benchmark: Optional[str]
    flagged: bool

def extract_ngrams(text: str, n: int = 8) -> set:
    text = text.lower().strip()
    tokens = text.split()
    if len(tokens) < n:
        return set()
    return {" ".join(tokens[i:i + n]) for i in range(len(tokens) - n + 1)}

def ngram_overlap_score(train_text: str, benchmark_ngrams: set, n: int = 8) -> float:
    sample_ngrams = extract_ngrams(train_text, n=n)
    if not sample_ngrams:
        return 0.0
    return len(sample_ngrams & benchmark_ngrams) / len(sample_ngrams)

class ContaminationDetector:
    def __init__(
        self,
        benchmark_texts: list[str],
        model_name: str = "all-MiniLM-L6-v2",
        ngram_size: int = 8,
        ngram_threshold: float = 0.5,
        embedding_threshold: float = 0.85,
        combined_threshold: float = 0.4,
    ):
        self.benchmark_texts = benchmark_texts
        self.ngram_size = ngram_size
        self.ngram_threshold = ngram_threshold
        self.embedding_threshold = embedding_threshold
        self.combined_threshold = combined_threshold

        # Build n-gram index
        self.benchmark_ngrams = set()
        for text in benchmark_texts:
            self.benchmark_ngrams.update(extract_ngrams(text, n=ngram_size))

        # Build embedding index
        self.model = SentenceTransformer(model_name)
        self.benchmark_embeddings = self.model.encode(
            benchmark_texts, normalize_embeddings=True, show_progress_bar=False
        )

    def check_samples(self, training_texts: list[str]) -> list[ContaminationResult]:
        training_embeddings = self.model.encode(
            training_texts, normalize_embeddings=True, show_progress_bar=True
        )
        sim_matrix = training_embeddings @ self.benchmark_embeddings.T

        results = []
        for i, text in enumerate(training_texts):
            ng_score = ngram_overlap_score(text, self.benchmark_ngrams, self.ngram_size)
            emb_score = float(np.max(sim_matrix[i]))
            best_match_idx = int(np.argmax(sim_matrix[i]))

            # Weighted combination: embedding similarity is the stronger signal
            combined = 0.4 * ng_score + 0.6 * emb_score
            flagged = (
                ng_score > self.ngram_threshold
                or emb_score > self.embedding_threshold
                or combined > self.combined_threshold
            )

            results.append(ContaminationResult(
                sample_index=i,
                text_preview=text[:120],
                ngram_score=round(ng_score, 4),
                embedding_score=round(emb_score, 4),
                combined_score=round(combined, 4),
                matched_benchmark=self.benchmark_texts[best_match_idx] if flagged else None,
                flagged=flagged,
            ))
        return results

    def generate_report(self, results: list[ContaminationResult]) -> dict:
        flagged = [r for r in results if r.flagged]
        report = {
            "total_samples": len(results),
            "flagged_count": len(flagged),
            "contamination_rate": round(len(flagged) / len(results) * 100, 2) if results else 0,
            "flagged_samples": [asdict(r) for r in flagged],
        }
        return report

# Usage
benchmark_data = load_dataset("Rowan/hellaswag", split="validation[:200]")
benchmark_texts = [ex["ctx"] for ex in benchmark_data]

detector = ContaminationDetector(benchmark_texts=benchmark_texts)

# Check your training data
training_data = [
    "A woman walks into a kitchen and picks up a pan to start cooking dinner.",
    "Gradient descent minimizes the loss function by iteratively updating model weights.",
    "A man is seen playing basketball. He dribbles the ball and takes a shot at the hoop.",
]

results = detector.check_samples(training_data)
report = detector.generate_report(results)

print(f"Contamination rate: {report['contamination_rate']}%")
print(f"Flagged: {report['flagged_count']} / {report['total_samples']}")

# Save the full report
with open("contamination_report.json", "w") as f:
    json.dump(report, f, indent=2)

The combined score weights embedding similarity at 60% and n-gram overlap at 40%. Embedding similarity catches more real contamination in practice, while n-gram overlap is nearly zero-false-positive for verbatim copies. A sample gets flagged if either individual signal exceeds its threshold or the combined score crosses 0.4.

Scaling to Large Datasets

For training corpora with millions of examples, you cannot encode everything in a single batch. Process in chunks and use FAISS for approximate nearest neighbor search instead of the brute-force dot product:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
import faiss

# Build a FAISS index from benchmark embeddings
dimension = benchmark_embeddings.shape[1]
index = faiss.IndexFlatIP(dimension)  # inner product = cosine sim for normalized vectors
index.add(benchmark_embeddings)

# Query in batches
batch_size = 10000
for start in range(0, len(training_embeddings), batch_size):
    batch = training_embeddings[start:start + batch_size]
    scores, indices = index.search(batch, k=1)  # top-1 nearest benchmark item
    # scores[i][0] is the max similarity for training sample i

This drops the search time from O(n * m) to roughly O(n * log(m)) for large benchmark sets.

Common Errors and Fixes

RuntimeError: CUDA out of memory when encoding large datasets. The sentence-transformers encode method loads everything onto GPU by default. Set batch_size explicitly:

1
embeddings = model.encode(texts, batch_size=64, device="cpu", normalize_embeddings=True)

Use device="cpu" when your GPU cannot hold the full batch. It’s slower but won’t crash.

Low contamination scores on known contaminated data. This usually means your n-gram size is too large. If benchmark items are short (single sentences), 8-grams might be longer than the entire text. Drop to n=5 or even n=3 for short-text benchmarks like TruthfulQA or MMLU.

False positives from common phrases. N-gram overlap flags samples that share boilerplate text with the benchmark (e.g., “the answer is” or “which of the following”). Fix this by filtering out n-grams that appear in more than 1% of your training set before computing overlap – they’re too common to indicate contamination.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
from collections import Counter

# Count n-gram frequency across training data
ngram_counts = Counter()
for text in training_texts:
    ngram_counts.update(extract_ngrams(text, n=8))

# Remove n-grams that appear in >1% of samples
total = len(training_texts)
common_ngrams = {ng for ng, count in ngram_counts.items() if count / total > 0.01}
filtered_benchmark_ngrams = all_benchmark_ngrams - common_ngrams

KeyError or missing fields when loading benchmark datasets. Different datasets on Hugging Face use different column names. Always inspect the dataset first:

1
2
3
ds = load_dataset("Rowan/hellaswag", split="validation[:5]")
print(ds.column_names)
print(ds[0])

Then pick the right field for your contamination check – usually the question or context field, not labels.