When your training data changes between runs, you need to know exactly what changed. Maybe a vendor silently dropped 10,000 rows. Maybe someone “fixed” a label column and introduced a typo across half the dataset. A dataset diff pipeline catches silent data corruption, tracks additions and deletions, and gives you an audit trail that makes debugging six months from now actually possible.

The approach here is straightforward: hash every row, compare hashes between versions, and produce a structured changelog. No heavyweight infrastructure required – just Python, Pandas, and hashlib.

Hash-Based Row Diffing

The core idea is to assign each row a deterministic hash based on its content, then use a primary key to match rows across two dataset versions. Rows present in only one version are additions or deletions. Rows present in both but with different hashes are modifications.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import pandas as pd
import hashlib
import json
from datetime import datetime

def compute_row_hash(row: pd.Series, columns: list[str]) -> str:
    """Hash the concatenated values of specified columns."""
    raw = "|".join(str(row[col]) for col in columns)
    return hashlib.sha256(raw.encode("utf-8")).hexdigest()

# Sample dataset v1
v1 = pd.DataFrame({
    "id": [1, 2, 3, 4, 5],
    "text": [
        "The movie was fantastic",
        "Terrible acting throughout",
        "A masterpiece of cinema",
        "Waste of time",
        "Solid performances all around",
    ],
    "label": ["positive", "negative", "positive", "negative", "positive"],
})

# Sample dataset v2 — row 3 label changed, row 5 removed, row 6 added
v2 = pd.DataFrame({
    "id": [1, 2, 3, 4, 6],
    "text": [
        "The movie was fantastic",
        "Terrible acting throughout",
        "A masterpiece of cinema",
        "Waste of time",
        "Surprisingly good sequel",
    ],
    "label": ["positive", "negative", "negative", "negative", "positive"],
})

content_cols = ["text", "label"]

v1["_hash"] = v1.apply(compute_row_hash, axis=1, columns=content_cols)
v2["_hash"] = v2.apply(compute_row_hash, axis=1, columns=content_cols)

pk = "id"
merged = v1[[pk, "_hash"]].merge(
    v2[[pk, "_hash"]], on=pk, how="outer", suffixes=("_old", "_new")
)

added = merged[merged["_hash_old"].isna()][pk].tolist()
removed = merged[merged["_hash_new"].isna()][pk].tolist()
modified = merged[
    merged["_hash_old"].notna()
    & merged["_hash_new"].notna()
    & (merged["_hash_old"] != merged["_hash_new"])
][pk].tolist()

print(f"Added:    {added}")    # [6]
print(f"Removed:  {removed}")  # [5]
print(f"Modified: {modified}") # [3]

This runs in seconds for datasets up to a few million rows. The SHA-256 hash is overkill for most use cases (MD5 would be faster), but it avoids any collision debates during code review.

Generating Structured Changelogs

Knowing the diff is step one. You also want a persistent record – a changelog file that accumulates across versions so you can trace the full history of your dataset.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
def build_changelog_entry(
    version_old: str,
    version_new: str,
    added_ids: list,
    removed_ids: list,
    modified_ids: list,
    v2_df: pd.DataFrame,
    pk: str,
    max_samples: int = 3,
) -> dict:
    """Build a structured changelog entry with sample rows."""
    entry = {
        "timestamp": datetime.utcnow().isoformat() + "Z",
        "version_old": version_old,
        "version_new": version_new,
        "stats": {
            "added": len(added_ids),
            "removed": len(removed_ids),
            "modified": len(modified_ids),
            "total_changes": len(added_ids) + len(removed_ids) + len(modified_ids),
        },
        "sample_added": (
            v2_df[v2_df[pk].isin(added_ids[:max_samples])]
            .drop(columns=["_hash"], errors="ignore")
            .to_dict(orient="records")
        ),
        "sample_removed_ids": removed_ids[:max_samples],
        "sample_modified_ids": modified_ids[:max_samples],
    }
    return entry

entry = build_changelog_entry(
    version_old="v1.0",
    version_new="v1.1",
    added_ids=added,
    removed_ids=removed,
    modified_ids=modified,
    v2_df=v2,
    pk=pk,
)

# Append to changelog file
changelog_path = "dataset_changelog.json"
try:
    with open(changelog_path, "r") as f:
        changelog = json.load(f)
except FileNotFoundError:
    changelog = []

changelog.append(entry)

with open(changelog_path, "w") as f:
    json.dump(changelog, f, indent=2)

print(json.dumps(entry, indent=2))

This gives you output like:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
{
  "timestamp": "2026-02-15T10:30:00Z",
  "version_old": "v1.0",
  "version_new": "v1.1",
  "stats": {
    "added": 1,
    "removed": 1,
    "modified": 1,
    "total_changes": 3
  },
  "sample_added": [
    {"id": 6, "text": "Surprisingly good sequel", "label": "positive"}
  ],
  "sample_removed_ids": [5],
  "sample_modified_ids": [3]
}

Store this changelog alongside your dataset in version control or object storage. When something breaks downstream, you grep the changelog instead of re-diffing old snapshots.

Column-Level Diff for Modified Rows

Knowing that row 3 changed is useful. Knowing that its label column flipped from "positive" to "negative" is actionable. For modified rows, you want a per-column diff.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
def column_level_diff(
    v1_df: pd.DataFrame,
    v2_df: pd.DataFrame,
    modified_ids: list,
    pk: str,
    compare_cols: list[str],
) -> list[dict]:
    """For each modified row, report which columns changed and the old/new values."""
    diffs = []
    v1_indexed = v1_df.set_index(pk)
    v2_indexed = v2_df.set_index(pk)

    for row_id in modified_ids:
        old_row = v1_indexed.loc[row_id]
        new_row = v2_indexed.loc[row_id]
        changes = {}
        for col in compare_cols:
            if old_row[col] != new_row[col]:
                changes[col] = {"old": old_row[col], "new": new_row[col]}
        if changes:
            diffs.append({"id": row_id, "changes": changes})

    return diffs

diffs = column_level_diff(v1, v2, modified, pk="id", compare_cols=content_cols)
for d in diffs:
    print(d)
# {'id': 3, 'changes': {'label': {'old': 'positive', 'new': 'negative'}}}

This is where you catch the sneaky bugs. A label flip on a single row is invisible in aggregate stats but shows up immediately in a column-level diff. Pipe these diffs into your alerting system or CI checks – if more than N% of labels changed between versions, block the pipeline and flag it for review.

Common Errors and Fixes

Memory issues with large datasets. If your dataset has tens of millions of rows, loading both versions into memory at once will blow up. Process in chunks instead: read both CSVs with pd.read_csv(..., chunksize=100_000), compute hashes per chunk, then merge the hash tables. The hash table itself (just primary key + hash string) is much smaller than the full data.

Floating point comparison issues. Hashing floats is a trap. The value 0.30000000000000004 and 0.3 produce different hashes even though they represent the same number. Round numeric columns before hashing:

1
2
3
4
5
6
7
8
def compute_row_hash_safe(row: pd.Series, columns: list[str], float_precision: int = 6) -> str:
    parts = []
    for col in columns:
        val = row[col]
        if isinstance(val, float):
            val = round(val, float_precision)
        parts.append(str(val))
    return hashlib.sha256("|".join(parts).encode("utf-8")).hexdigest()

Missing primary key. If your dataset does not have a natural primary key, use the row index – but be aware that row order changes (like a shuffle) will make every row look “added” and “removed.” A better approach is to hash all columns to create a composite key, then diff based on presence or absence of that full-row hash.

Encoding issues. Non-ASCII characters in text columns can cause inconsistent hashes across platforms if you are not explicit about encoding. Always use .encode("utf-8") as shown above, and normalize Unicode with unicodedata.normalize("NFC", text) before hashing if your data contains accented characters or non-Latin scripts.