Biased training data produces biased models. It’s that simple. You can have the fanciest architecture in the world, but if your dataset over-represents one demographic, under-labels a class, or has drifted features between train and test splits, your model will learn the wrong patterns. Catching these problems before training saves you weeks of debugging mysterious performance gaps in production.
This pipeline runs four automated checks on any tabular dataset: class imbalance, demographic skew, label noise, and feature distribution drift. Everything uses pandas, scipy, sklearn, and matplotlib — no exotic dependencies.
Check Class Distribution#
The first thing to check in any classification dataset is whether your classes are balanced. A 95/5 split between positive and negative labels will train a model that just predicts the majority class and looks accurate doing it.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
| import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# Create a sample dataset with class imbalance
np.random.seed(42)
n_samples = 2000
data = pd.DataFrame({
"feature_1": np.random.randn(n_samples),
"feature_2": np.random.randn(n_samples),
"age": np.random.choice([20, 30, 40, 50, 60], size=n_samples),
"gender": np.random.choice(["M", "F"], size=n_samples, p=[0.7, 0.3]),
"label": np.random.choice([0, 1, 2], size=n_samples, p=[0.80, 0.15, 0.05]),
})
# Compute class frequencies
class_counts = data["label"].value_counts().sort_index()
class_ratios = class_counts / len(data)
print("Class distribution:")
print(class_counts)
print(f"\nImbalance ratio (max/min): {class_counts.max() / class_counts.min():.1f}x")
# Flag severe imbalance (threshold: 10x ratio between largest and smallest class)
imbalance_ratio = class_counts.max() / class_counts.min()
if imbalance_ratio > 10:
print(f"WARNING: Severe class imbalance detected ({imbalance_ratio:.1f}x)")
elif imbalance_ratio > 5:
print(f"NOTICE: Moderate class imbalance ({imbalance_ratio:.1f}x)")
else:
print("Class balance looks reasonable.")
# Visualize
fig, ax = plt.subplots(figsize=(8, 4))
bars = ax.bar(class_counts.index.astype(str), class_counts.values, color=["#2ecc71", "#e74c3c", "#3498db"])
ax.set_xlabel("Class Label")
ax.set_ylabel("Count")
ax.set_title("Class Distribution")
for bar, count in zip(bars, class_counts.values):
ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 10,
str(count), ha="center", va="bottom", fontweight="bold")
plt.tight_layout()
plt.savefig("class_distribution.png", dpi=150)
plt.show()
|
That imbalance_ratio is your early warning system. Anything above 10x deserves attention — consider oversampling the minority class, using class weights in your loss function, or collecting more data for the underrepresented classes.
Detect Demographic Skew#
Class imbalance is one problem. Demographic skew is a different beast — it means protected attributes like age or gender are unevenly distributed across your target labels. A chi-squared test of independence tells you whether two categorical variables are statistically independent.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
| from scipy.stats import chi2_contingency
def check_demographic_skew(df, protected_col, label_col, alpha=0.05):
"""Run chi-squared test for independence between a protected attribute and label."""
contingency = pd.crosstab(df[protected_col], df[label_col])
chi2, p_value, dof, expected = chi2_contingency(contingency)
print(f"\n--- {protected_col} vs {label_col} ---")
print(f"Contingency table:\n{contingency}")
print(f"Chi-squared: {chi2:.4f}, p-value: {p_value:.6f}, dof: {dof}")
if p_value < alpha:
print(f"BIAS DETECTED: {protected_col} and {label_col} are NOT independent (p={p_value:.6f})")
else:
print(f"OK: No significant association found (p={p_value:.6f})")
return {"attribute": protected_col, "chi2": chi2, "p_value": p_value, "biased": p_value < alpha}
# Check both protected attributes
results = []
for col in ["gender", "age"]:
result = check_demographic_skew(data, col, "label")
results.append(result)
# Summary report
bias_report = pd.DataFrame(results)
print("\n=== Demographic Skew Summary ===")
print(bias_report.to_string(index=False))
|
The p_value < 0.05 threshold means there’s less than a 5% chance the skew happened by random chance. When a protected attribute is statistically associated with the label, you need to investigate whether the real-world phenomenon justifies it or if it’s a data collection artifact. For example, if gender predicts loan approval in your dataset, that’s a red flag worth digging into.
Watch out for small sample sizes — chi-squared tests become unreliable when expected cell counts drop below 5. The function returns the expected frequencies so you can verify this manually.
Find Label Noise with Cross-Validation#
Label noise is sneaky. Mislabeled examples look normal in summary statistics but quietly degrade your model. The trick is to use cross-validation predictions: if a well-tuned classifier is confident an example belongs to class A but it’s labeled class B, that label is probably wrong.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
| from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_predict
# Prepare features and labels
X = data[["feature_1", "feature_2", "age"]].copy()
X["gender_encoded"] = data["gender"].map({"M": 0, "F": 1})
y = data["label"].values
# Get out-of-fold predicted probabilities
clf = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
proba = cross_val_predict(clf, X, y, cv=5, method="predict_proba")
# For each sample, get the probability assigned to its actual label
true_label_proba = proba[np.arange(len(y)), y]
# Flag samples where the model is very confident the label is wrong
# (low probability on the assigned label = likely mislabeled)
noise_threshold = 0.1
suspect_mask = true_label_proba < noise_threshold
suspect_indices = np.where(suspect_mask)[0]
print(f"Total samples: {len(y)}")
print(f"Suspected mislabeled: {len(suspect_indices)} ({100 * len(suspect_indices) / len(y):.1f}%)")
print(f"\nTop 10 most suspicious samples:")
suspect_df = data.iloc[suspect_indices].copy()
suspect_df["true_label_prob"] = true_label_proba[suspect_indices]
suspect_df["predicted_class"] = proba[suspect_indices].argmax(axis=1)
print(suspect_df.nsmallest(10, "true_label_prob")[
["feature_1", "feature_2", "label", "predicted_class", "true_label_prob"]
].to_string(index=True))
|
The noise_threshold of 0.1 means you’re flagging samples where the model assigns less than 10% probability to the given label. Tune this based on your tolerance. For high-stakes datasets (medical, financial), manually review every flagged example. For large noisy datasets, removing the bottom 1-2% of confidence scores often improves model performance noticeably.
This approach works best with a reasonably expressive model. Random forests are a good default because they handle mixed feature types and rarely overfit enough to memorize noise.
Measure Feature Distribution Drift#
When your train and test sets have different feature distributions, your model’s test performance won’t reflect real-world behavior. The Kolmogorov-Smirnov (KS) test compares two distributions and tells you if they’re statistically different.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
| from scipy.stats import ks_2samp
from sklearn.model_selection import train_test_split
# Split into train/test
train_df, test_df = train_test_split(data, test_size=0.3, random_state=42)
# Intentionally inject drift into test set for demonstration
test_df = test_df.copy()
test_df["feature_1"] = test_df["feature_1"] + 0.8 # shift the mean
numerical_features = ["feature_1", "feature_2", "age"]
def detect_drift(train, test, features, alpha=0.05):
"""Run KS tests on each numerical feature between train and test."""
drift_results = []
for feat in features:
stat, p_value = ks_2samp(train[feat].dropna(), test[feat].dropna())
drifted = p_value < alpha
drift_results.append({
"feature": feat,
"ks_statistic": round(stat, 4),
"p_value": round(p_value, 6),
"drifted": drifted,
})
if drifted:
print(f"DRIFT: {feat} (KS={stat:.4f}, p={p_value:.6f})")
else:
print(f"OK: {feat} (KS={stat:.4f}, p={p_value:.6f})")
return pd.DataFrame(drift_results)
print("=== Feature Distribution Drift (Train vs Test) ===")
drift_report = detect_drift(train_df, test_df, numerical_features)
# Visualize drifted features
drifted_features = drift_report[drift_report["drifted"]]["feature"].tolist()
if drifted_features:
fig, axes = plt.subplots(1, len(drifted_features), figsize=(6 * len(drifted_features), 4))
if len(drifted_features) == 1:
axes = [axes]
for ax, feat in zip(axes, drifted_features):
ax.hist(train_df[feat], bins=40, alpha=0.6, label="Train", density=True)
ax.hist(test_df[feat], bins=40, alpha=0.6, label="Test", density=True)
ax.set_title(f"{feat} (drifted)")
ax.legend()
plt.tight_layout()
plt.savefig("drift_comparison.png", dpi=150)
plt.show()
print(f"\n{len(drifted_features)} of {len(numerical_features)} features show significant drift.")
|
The KS statistic ranges from 0 to 1. Higher values mean the distributions are more different. A p-value below 0.05 means the difference is statistically significant. In the example above, feature_1 was intentionally shifted by 0.8 standard deviations, so the KS test flags it immediately.
For categorical features, use the chi-squared test instead of KS. Compare the frequency distributions of each category between train and test. The pattern is the same — build a contingency table and call chi2_contingency.
Common Errors and Fixes#
ValueError: Found input variables with inconsistent numbers of samples
This happens when your feature matrix and label array have different lengths. Usually caused by dropping NaN rows from features but not from labels, or vice versa.
1
2
3
4
5
6
7
8
| # Wrong: separate dropna calls can produce mismatched lengths
X = data[["feature_1", "feature_2"]].dropna()
y = data["label"].dropna()
# Right: drop rows with NaN from the full dataframe first
clean_data = data.dropna(subset=["feature_1", "feature_2", "label"])
X = clean_data[["feature_1", "feature_2"]]
y = clean_data["label"]
|
ValueError: Expected 2D array, got 1D array instead
sklearn expects features as a 2D array. If you pass a single column as a Series, you get this error.
1
2
3
4
5
| # Wrong
clf.fit(data["feature_1"], y)
# Right: use double brackets to keep it as a DataFrame (2D)
clf.fit(data[["feature_1"]], y)
|
LinAlgError: singular matrix or very small chi-squared p-values on tiny datasets
Chi-squared tests break down when expected cell frequencies are below 5. This happens with small datasets or rare categories.
1
2
3
4
5
6
7
8
9
| # Check expected frequencies before trusting chi-squared results
contingency = pd.crosstab(data["gender"], data["label"])
chi2, p_value, dof, expected = chi2_contingency(contingency)
# Warn if any expected frequency is below 5
if (expected < 5).any():
print("WARNING: Expected frequencies below 5 detected. "
"Chi-squared results may be unreliable. "
"Consider Fisher's exact test or merging rare categories.")
|
If you hit this, either merge rare categories together or switch to Fisher’s exact test (scipy.stats.fisher_exact) for 2x2 tables. For larger tables with sparse cells, consider a permutation-based test instead.