Profile First, Clean Second

Most data cleaning is reactive – someone finds a bug, you patch it. A better approach: profile the dataset programmatically, extract what is broken, and apply rule-based fixes automatically. That way cleaning is reproducible and auditable.

Here is the full stack: ydata-profiling generates the profile, pandas handles transformations, and scipy catches outliers. Install everything first:

1
pip install ydata-profiling pandas scipy

Now generate a profile report from a messy dataset:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import pandas as pd
import numpy as np
from ydata_profiling import ProfileReport

# Create a realistic messy dataset
np.random.seed(42)
n = 500

df = pd.DataFrame({
    "customer_id": range(1, n + 1),
    "age": np.random.choice([25, 30, 35, 40, None, 999, -5, 45, 50, 28], size=n),
    "income": np.concatenate([
        np.random.normal(55000, 15000, n - 10),
        [1e9, -500, None, None, None, 0, 0, 999999, 55000, 55000]
    ]),
    "signup_date": np.random.choice(
        ["2024-01-15", "2024-02-20", "not_a_date", None, "2024-13-45", "2024-06-10"],
        size=n
    ),
    "email": [f"user{i}@example.com" for i in range(1, n + 1)],
    "category": np.random.choice(["A", "B", "C", None, "UNKNOWN", ""], size=n),
})

# Inject exact duplicates
df = pd.concat([df, df.sample(20, random_state=42)], ignore_index=True)

print(f"Raw dataset: {df.shape[0]} rows, {df.shape[1]} columns")

# Generate profile report
profile = ProfileReport(df, title="Data Quality Report", explorative=True)
profile.to_file("profile_report.html")
print("Profile report saved to profile_report.html")

Open that HTML file and you will see every column broken down: missing counts, distributions, correlations, duplicate rows, and alerts. But the real power is extracting those alerts programmatically.

Extract Cleaning Recommendations from the Profile

The profile report object has a structured description you can query in code. No need to eyeball the HTML – pull the stats directly:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
description = profile.get_description()

# Check alerts raised by the profiler
alerts = profile.get_description().alerts
for alert in alerts:
    print(f"Alert: {alert.alert_type.name} on column '{alert.column_name}'")

# Extract per-column stats
variables = description.variables
for col_name, stats in variables.items():
    n_missing = stats.get("n_missing", 0)
    p_missing = stats.get("p_missing", 0)
    col_type = stats.get("type", "unknown")
    n_distinct = stats.get("n_distinct", 0)

    if p_missing > 0:
        print(f"  {col_name}: {p_missing:.1%} missing ({n_missing} values)")
    if n_distinct is not None and n_distinct < 10:
        print(f"  {col_name}: low cardinality ({n_distinct} distinct values)")

The alerts list tells you which columns have high missing rates, high cardinality, constant values, or strong correlations. You can build your cleaning rules directly from these signals instead of hardcoding column names.

Build the Auto-Cleaning Pipeline

Now wire it together. This pipeline handles the four most common data quality problems: missing values, outliers, type coercion failures, and duplicates.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
from scipy import stats as scipy_stats

def auto_clean(df, z_threshold=3.0, iqr_multiplier=1.5):
    """Profile-driven auto-cleaning pipeline."""
    report = {}
    df_clean = df.copy()

    # --- Step 1: Deduplicate ---
    before_rows = len(df_clean)
    df_clean = df_clean.drop_duplicates()
    removed_dupes = before_rows - len(df_clean)
    report["duplicates_removed"] = removed_dupes
    print(f"Removed {removed_dupes} duplicate rows")

    # --- Step 2: Type coercion ---
    # Force numeric columns that got mixed with strings
    for col in df_clean.select_dtypes(include=["object"]).columns:
        converted = pd.to_numeric(df_clean[col], errors="coerce")
        non_null_ratio = converted.notna().sum() / len(df_clean)
        if non_null_ratio > 0.5:
            df_clean[col] = converted
            print(f"Coerced '{col}' to numeric ({non_null_ratio:.0%} parseable)")

    # Coerce date-like columns
    for col in df_clean.select_dtypes(include=["object"]).columns:
        converted = pd.to_datetime(df_clean[col], errors="coerce", format="mixed")
        non_null_ratio = converted.notna().sum() / len(df_clean)
        if non_null_ratio > 0.5:
            df_clean[col] = converted
            print(f"Coerced '{col}' to datetime ({non_null_ratio:.0%} parseable)")

    # --- Step 3: Outlier detection and capping ---
    numeric_cols = df_clean.select_dtypes(include=[np.number]).columns
    report["outliers_capped"] = {}

    for col in numeric_cols:
        series = df_clean[col].dropna()
        if len(series) < 10:
            continue

        # IQR method
        q1 = series.quantile(0.25)
        q3 = series.quantile(0.75)
        iqr = q3 - q1
        lower_bound = q1 - iqr_multiplier * iqr
        upper_bound = q3 + iqr_multiplier * iqr

        # Z-score method (flag extreme outliers)
        z_scores = np.abs(scipy_stats.zscore(series, nan_policy="omit"))
        z_outlier_count = int((z_scores > z_threshold).sum())

        # Cap using IQR bounds
        before_outliers = int(
            ((df_clean[col] < lower_bound) | (df_clean[col] > upper_bound)).sum()
        )
        df_clean[col] = df_clean[col].clip(lower=lower_bound, upper=upper_bound)

        if before_outliers > 0:
            report["outliers_capped"][col] = {
                "iqr_capped": before_outliers,
                "z_score_flagged": z_outlier_count,
                "bounds": [round(lower_bound, 2), round(upper_bound, 2)],
            }
            print(
                f"Capped {before_outliers} outliers in '{col}' "
                f"to [{lower_bound:.2f}, {upper_bound:.2f}]"
            )

    # --- Step 4: Impute missing values ---
    report["imputed"] = {}
    for col in numeric_cols:
        n_missing = int(df_clean[col].isna().sum())
        if n_missing > 0:
            median_val = df_clean[col].median()
            df_clean[col] = df_clean[col].fillna(median_val)
            report["imputed"][col] = {"count": n_missing, "strategy": "median", "value": round(median_val, 2)}
            print(f"Imputed {n_missing} missing values in '{col}' with median ({median_val:.2f})")

    for col in df_clean.select_dtypes(include=["object", "category"]).columns:
        n_missing = int(df_clean[col].isna().sum())
        empty_count = int((df_clean[col] == "").sum()) if df_clean[col].dtype == "object" else 0
        total_missing = n_missing + empty_count
        if total_missing > 0:
            mode_val = df_clean[col].replace("", pd.NA).mode()
            if len(mode_val) > 0:
                fill_val = mode_val.iloc[0]
                df_clean[col] = df_clean[col].replace("", pd.NA).fillna(fill_val)
                report["imputed"][col] = {"count": total_missing, "strategy": "mode", "value": fill_val}
                print(f"Imputed {total_missing} missing/empty values in '{col}' with mode ('{fill_val}')")

    return df_clean, report

# Run the pipeline
df_cleaned, cleaning_report = auto_clean(df)

The pipeline caps outliers using IQR bounds rather than removing rows. Removing rows throws away signal. Capping preserves the record while neutralizing the extreme value. For numeric imputation, median beats mean because it is resistant to the very outliers you just capped.

Generate a Before/After Comparison

After cleaning, you want proof that it worked. This comparison function shows what changed:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
def compare_before_after(df_before, df_after, report):
    """Print a before/after summary of the cleaning."""
    print("=" * 60)
    print("DATA CLEANING SUMMARY")
    print("=" * 60)
    print(f"Rows: {len(df_before)} -> {len(df_after)} "
          f"({len(df_before) - len(df_after)} removed)")
    print(f"Duplicates removed: {report['duplicates_removed']}")
    print()

    # Missing value comparison
    print("Missing Values (before -> after):")
    for col in df_before.columns:
        before_missing = df_before[col].isna().sum()
        after_missing = df_after[col].isna().sum() if col in df_after.columns else 0
        if before_missing > 0 or after_missing > 0:
            print(f"  {col}: {before_missing} -> {after_missing}")
    print()

    # Numeric range comparison
    print("Numeric Ranges (before -> after):")
    for col in df_before.select_dtypes(include=[np.number]).columns:
        if col in df_after.columns:
            b_min, b_max = df_before[col].min(), df_before[col].max()
            a_min, a_max = df_after[col].min(), df_after[col].max()
            if b_min != a_min or b_max != a_max:
                print(f"  {col}: [{b_min:.2f}, {b_max:.2f}] -> [{a_min:.2f}, {a_max:.2f}]")

    # Outlier details
    if report.get("outliers_capped"):
        print()
        print("Outlier Capping Details:")
        for col, details in report["outliers_capped"].items():
            print(f"  {col}: {details['iqr_capped']} capped (IQR), "
                  f"{details['z_score_flagged']} flagged (z-score), "
                  f"bounds={details['bounds']}")

compare_before_after(df, df_cleaned, cleaning_report)

This gives you a plain-text audit trail. Pipe it to a log file, attach it to your data versioning system, or print it in your notebook. The key is having a record of every transformation applied.

Common Errors and Fixes

TypeError: ProfileReport() got an unexpected keyword argument 'explorative'

This happens on older versions of ydata-profiling. The explorative parameter was added in version 4.x. Upgrade:

1
pip install --upgrade ydata-profiling

If you are stuck on an older version, drop the explorative=True argument – the default report still covers the essentials.

ValueError: could not convert string to float: 'not_a_date'

This fires when you pass mixed-type columns directly to scipy’s zscore. The auto-clean pipeline avoids this by running type coercion before outlier detection, and by using nan_policy="omit" in the zscore call. Make sure you coerce first, detect second.

SettingWithCopyWarning: A value is trying to be set on a copy of a slice

Pandas throws this when you modify a filtered view instead of the original DataFrame. The pipeline uses df_clean = df.copy() at the top to avoid this. If you see it in your own code, add an explicit .copy() call when you slice.

ImportError: cannot import name 'ProfileReport' from 'pandas_profiling'

The library was renamed from pandas-profiling to ydata-profiling. The import changed too:

1
2
3
4
5
# Old (broken)
from pandas_profiling import ProfileReport

# New (correct)
from ydata_profiling import ProfileReport

Uninstall the old package if it is still lingering:

1
pip uninstall pandas-profiling && pip install ydata-profiling

OutOfMemoryError on large datasets

ydata-profiling loads everything into memory for correlation computation. For datasets over 1M rows, use the minimal mode:

1
profile = ProfileReport(df, minimal=True, title="Minimal Report")

This skips expensive correlation and interaction calculations while keeping the core column statistics you need for automated cleaning.