How to Clean and Deduplicate ML Datasets with Python

The Fast Path: Exact and Near-Duplicate Removal

Duplicate training examples bias your model toward repeated patterns and inflate metrics during evaluation. The fix depends on what kind of duplicates you have. Exact duplicates are trivial with pandas. Near-duplicates – rephrased text, slightly altered records – require fuzzy hashing.

Here is a complete pipeline that handles both cases on a text dataset:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import pandas as pd
from datasketch import MinHash, MinHashLSH

# Load your dataset
df = pd.read_csv("training_data.csv")
print(f"Starting rows: {len(df)}")

# Step 1: Drop exact duplicates
df = df.drop_duplicates(subset=["text"], keep="first")
print(f"After exact dedup: {len(df)}")

# Step 2: Near-duplicate detection with MinHash LSH
lsh = MinHashLSH(threshold=0.8, num_perm=128)
minhashes = {}
duplicates = set()

for idx, text in enumerate(df["text"].values):
    mh = MinHash(num_perm=128)
    for word in text.lower().split():
        mh.update(word.encode("utf-8"))

    # Query for near-duplicates before inserting
    result = lsh.query(mh)
    if result:
        duplicates.add(idx)
    else:
        lsh.insert(str(idx), mh)
        minhashes[idx] = mh

df_clean = df.drop(df.index[list(duplicates)])
print(f"After near-dedup: {len(df_clean)}")

Install datasketch first:

1
pip install datasketch

The threshold=0.8 means two documents with 80% or more Jaccard similarity count as near-duplicates. For training data, 0.7-0.8 works well. Push it to 0.9 if you only want to catch near-identical copies.

Handling Missing Values Without Destroying Signal

Missing values in ML datasets are not just an annoyance – they crash your training loop or silently degrade predictions. The right fix depends on the column type and how much is missing.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import pandas as pd
import numpy as np

df = pd.read_csv("features.csv")

# Check what you're dealing with
missing = df.isnull().sum()
missing_pct = (missing / len(df)) * 100
print(missing_pct[missing_pct > 0].sort_values(ascending=False))

# Rule of thumb: drop columns missing more than 40% of values
threshold = 0.4
sparse_cols = missing_pct[missing_pct > threshold * 100].index.tolist()
df = df.drop(columns=sparse_cols)
print(f"Dropped sparse columns: {sparse_cols}")

# Numerical columns: fill with median (robust to outliers)
num_cols = df.select_dtypes(include=[np.number]).columns
df[num_cols] = df[num_cols].fillna(df[num_cols].median())

# Categorical columns: fill with mode
cat_cols = df.select_dtypes(include=["object", "category"]).columns
for col in cat_cols:
    df[col] = df[col].fillna(df[col].mode()[0])

# Verify
assert df.isnull().sum().sum() == 0, "Still have missing values"

Do not blindly use mean imputation on numerical columns. If your feature has outliers (and ML features usually do), the mean gets pulled toward those extremes. Median is almost always the safer default.

A common error you will hit with fillna on categorical columns:

1
IndexError: index 0 is out of bounds for axis 0 with size 0

This happens when a column is entirely null, so mode() returns an empty Series. Guard against it:

1
2
3
for col in cat_cols:
    mode_val = df[col].mode()
    df[col] = df[col].fillna(mode_val[0] if len(mode_val) > 0 else "unknown")

Detecting and Removing Outliers

Outliers in feature columns shift learned decision boundaries. The IQR method catches the obvious ones without assuming your data is normally distributed:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
def remove_outliers_iqr(df, columns, factor=1.5):
    """Remove rows with outliers in any of the specified columns."""
    mask = pd.Series(True, index=df.index)
    for col in columns:
        q1 = df[col].quantile(0.25)
        q3 = df[col].quantile(0.75)
        iqr = q3 - q1
        lower = q1 - factor * iqr
        upper = q3 + factor * iqr
        mask &= df[col].between(lower, upper)

    removed = len(df) - mask.sum()
    print(f"Removing {removed} outlier rows ({removed/len(df)*100:.1f}%)")
    return df[mask]

feature_cols = ["price", "quantity", "rating"]
df = remove_outliers_iqr(df, feature_cols)

Be careful with the factor parameter. The default 1.5 is aggressive – it trims roughly 5% of normally distributed data. Use 3.0 if you only want to catch extreme outliers that are almost certainly data errors.

Scaling Up: text-dedup for Large Corpora

For datasets over a few hundred thousand rows, the per-row loop above gets slow. The text-dedup library wraps MinHash, SimHash, and suffix array methods into a CLI and Python API purpose-built for deduplicating large text corpora like those used for LLM training:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
pip install text-dedup

# MinHash dedup on a Hugging Face dataset
python -m text_dedup.minhash \
    --path "your_dataset" \
    --name "default" \
    --split "train" \
    --output "deduped_output" \
    --column "text" \
    --threshold 0.75 \
    --num_perm 256

text-dedup works directly with Hugging Face datasets and handles sharding, parallel processing, and memory management. On a single machine, it can deduplicate tens of millions of documents. For exact substring dedup (catching copy-pasted paragraphs rather than whole-document duplicates), use the suffix array method:

1
2
3
4
5
6
7
python -m text_dedup.suffix_array \
    --path "your_dataset" \
    --name "default" \
    --split "train" \
    --output "deduped_output" \
    --column "text" \
    --google_repo_path "/path/to/deduplicate-text-datasets"

This is the same approach Google Research used to deduplicate C4 and other pretraining corpora.

Fixing Data Types and Inconsistent Formats

Type mismatches silently corrupt your features. A “price” column read as strings because one row has a dollar sign, or dates stored as mixed formats, will either crash your model or produce garbage features:

1
2
3
4
5
6
7
8
9
# Force numeric conversion, coercing bad values to NaN
df["price"] = pd.to_numeric(df["price"].str.replace(r"[$,]", "", regex=True), errors="coerce")

# Standardize string columns
df["category"] = df["category"].str.strip().str.lower()

# Spot the damage
coerced_nulls = df["price"].isnull().sum()
print(f"Coerced {coerced_nulls} non-numeric price values to NaN")

The errors="coerce" flag is essential. Without it, pd.to_numeric raises a ValueError on the first non-numeric string and your whole pipeline stops. With it, bad values become NaN and you can handle them with the imputation code from earlier.

Common Pitfalls

Deduplicating after train/test split. Always deduplicate the full dataset first, then split. If the same record ends up in both train and test, your evaluation metrics are lying.

Using drop_duplicates on float columns. Floating-point comparison is unreliable. Two values that should be identical might differ by 1e-15 due to precision. Round first:

1
2
df["score"] = df["score"].round(6)
df = df.drop_duplicates(subset=["score", "user_id"])

Ignoring label-text mismatches. Duplicate text with different labels is a data quality issue, not just a dedup problem. Find and flag them:

1
2
3
conflicts = df.groupby("text")["label"].nunique()
conflicts = conflicts[conflicts > 1]
print(f"Found {len(conflicts)} texts with conflicting labels")

These conflicting records need manual review or majority-vote resolution – dropping them outright can remove valid edge cases.

The Fast Path: Exact and Near-Duplicate Removal#

Handling Missing Values Without Destroying Signal#

Detecting and Removing Outliers#

Scaling Up: text-dedup for Large Corpora#

Fixing Data Types and Inconsistent Formats#

Common Pitfalls#

Related Guides#

About the Author