Aggregate accuracy hides problems. Your model might hit 92% overall but fail badly on older users, rare categories, or edge-case feature combinations. The fix is straightforward: slice your data by meaningful features, evaluate each slice independently, and flag the weak segments before they cause production failures.

Here is a full pipeline that creates sample data, builds stratified splits, and evaluates per-slice performance:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import numpy as np
import pandas as pd
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score

# Create a realistic sample dataset
np.random.seed(42)
n = 2000

df = pd.DataFrame({
    "age": np.random.randint(18, 75, n),
    "income": np.random.normal(55000, 15000, n).astype(int),
    "category": np.random.choice(["electronics", "clothing", "food", "services"], n, p=[0.4, 0.3, 0.2, 0.1]),
    "region": np.random.choice(["north", "south", "east", "west"], n),
})

# Binary target with class imbalance
df["target"] = ((df["income"] > 60000).astype(int) + (df["age"] > 50).astype(int) + np.random.binomial(1, 0.2, n)) >= 2
df["target"] = df["target"].astype(int)

print(f"Dataset shape: {df.shape}")
print(f"Target distribution:\n{df['target'].value_counts(normalize=True)}")

Stratified Train/Test Splits That Preserve Class Distributions

A naive random split can skew class ratios, especially for smaller datasets or imbalanced targets. StratifiedShuffleSplit guarantees the train and test sets match the overall class distribution.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.preprocessing import LabelEncoder

# Encode categorical features for modeling
le_cat = LabelEncoder()
le_reg = LabelEncoder()
df["category_enc"] = le_cat.fit_transform(df["category"])
df["region_enc"] = le_reg.fit_transform(df["region"])

feature_cols = ["age", "income", "category_enc", "region_enc"]
X = df[feature_cols].values
y = df["target"].values

# Stratified split preserves target class ratios
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
train_idx, test_idx = next(sss.split(X, y))

X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y[train_idx], y[test_idx]

# Verify the distribution is preserved
train_ratio = y_train.mean()
test_ratio = y_test.mean()
print(f"Train positive rate: {train_ratio:.3f}")
print(f"Test positive rate:  {test_ratio:.3f}")
print(f"Difference:          {abs(train_ratio - test_ratio):.4f}")

The difference between train and test positive rates should be near zero. With a simple random split on a small dataset, you can easily get 2-3% drift.

Slicing Datasets by Feature Values

Slicing means grouping your test set by feature values and evaluating each group separately. This is where hidden failures surface. A model might nail the “electronics” category but stumble on “services” where the training data is sparse.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# Train a model
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

# Build a test dataframe with predictions
test_df = df.iloc[test_idx].copy()
test_df["pred"] = y_pred

# Slice by category
print("\n--- Per-Category Performance ---")
print(f"{'Category':<15} {'Count':>6} {'Accuracy':>10} {'F1':>8}")
print("-" * 42)

for cat in sorted(test_df["category"].unique()):
    mask = test_df["category"] == cat
    slice_true = test_df.loc[mask, "target"]
    slice_pred = test_df.loc[mask, "pred"]
    acc = accuracy_score(slice_true, slice_pred)
    f1 = f1_score(slice_true, slice_pred, zero_division=0)
    print(f"{cat:<15} {mask.sum():>6} {acc:>10.3f} {f1:>8.3f}")

# Slice by age group
test_df["age_group"] = pd.cut(test_df["age"], bins=[17, 30, 45, 60, 75], labels=["18-30", "31-45", "46-60", "61-75"])

print("\n--- Per-Age-Group Performance ---")
print(f"{'Age Group':<15} {'Count':>6} {'Accuracy':>10} {'F1':>8}")
print("-" * 42)

for grp in test_df["age_group"].cat.categories:
    mask = test_df["age_group"] == grp
    slice_true = test_df.loc[mask, "target"]
    slice_pred = test_df.loc[mask, "pred"]
    acc = accuracy_score(slice_true, slice_pred)
    f1 = f1_score(slice_true, slice_pred, zero_division=0)
    print(f"{grp:<15} {mask.sum():>6} {acc:>10.3f} {f1:>8.3f}")

The output gives you a clear table of where your model is strong and where it is weak. Any slice with F1 more than 10 points below the overall score deserves investigation.

Automatic Underperforming Slice Detection

Manually scanning tables works for a handful of slices. When you have dozens of feature combinations, automate the detection:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
from itertools import product

def detect_weak_slices(test_df, slice_columns, metric_fn, threshold_drop=0.10):
    """Find slices where performance drops below overall - threshold_drop."""
    overall = metric_fn(test_df["target"], test_df["pred"])
    weak_slices = []

    for col in slice_columns:
        for val in test_df[col].unique():
            mask = test_df[col] == val
            if mask.sum() < 10:  # skip tiny slices
                continue
            score = metric_fn(test_df.loc[mask, "target"], test_df.loc[mask, "pred"])
            if score < overall - threshold_drop:
                weak_slices.append({
                    "column": col,
                    "value": val,
                    "count": mask.sum(),
                    "score": round(score, 3),
                    "overall": round(overall, 3),
                    "gap": round(overall - score, 3),
                })

    return pd.DataFrame(weak_slices).sort_values("gap", ascending=False)

weak = detect_weak_slices(
    test_df,
    slice_columns=["category", "age_group", "region"],
    metric_fn=lambda y_true, y_pred: f1_score(y_true, y_pred, zero_division=0),
    threshold_drop=0.05,
)

if len(weak) > 0:
    print("\nUnderperforming slices detected:")
    print(weak.to_string(index=False))
else:
    print("\nNo underperforming slices found (all within 5% of overall F1)")

Set threshold_drop based on your tolerance. For safety-critical applications, 0.03 (3%) is a reasonable bar. For recommendation systems, 0.10 might be fine.

Stratified Cross-Validation with Slice-Aware Evaluation

Single train/test splits can be noisy. Stratified k-fold gives you more stable per-slice estimates by averaging across multiple folds:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
fold_results = []

for fold, (train_idx, test_idx) in enumerate(skf.split(X, y)):
    clf = RandomForestClassifier(n_estimators=100, random_state=42)
    clf.fit(X[train_idx], y[train_idx])
    preds = clf.predict(X[test_idx])

    fold_df = df.iloc[test_idx].copy()
    fold_df["pred"] = preds
    fold_df["fold"] = fold

    fold_results.append(fold_df)

all_folds = pd.concat(fold_results, ignore_index=True)

# Aggregate per-slice performance across all folds
print("\n--- Cross-Validated Per-Category Performance ---")
print(f"{'Category':<15} {'Mean Acc':>10} {'Std Acc':>10} {'Mean F1':>10} {'Std F1':>10}")
print("-" * 58)

for cat in sorted(all_folds["category"].unique()):
    mask = all_folds["category"] == cat
    cat_df = all_folds[mask]

    fold_accs = []
    fold_f1s = []
    for fold in range(5):
        fold_mask = cat_df["fold"] == fold
        if fold_mask.sum() == 0:
            continue
        fold_accs.append(accuracy_score(cat_df.loc[fold_mask, "target"], cat_df.loc[fold_mask, "pred"]))
        fold_f1s.append(f1_score(cat_df.loc[fold_mask, "target"], cat_df.loc[fold_mask, "pred"], zero_division=0))

    print(f"{cat:<15} {np.mean(fold_accs):>10.3f} {np.std(fold_accs):>10.3f} {np.mean(fold_f1s):>10.3f} {np.std(fold_f1s):>10.3f}")

High standard deviation for a specific slice means your model’s performance on that segment is unstable – it depends heavily on which examples land in train vs test. That slice needs more training data or a feature rethink.

Common Errors and Fixes

ValueError: The least populated class in y has only 1 member.

This happens when StratifiedKFold or StratifiedShuffleSplit encounters a class with fewer members than the number of splits. Fix it by reducing n_splits or merging rare classes:

1
2
3
4
# Check minimum class count before splitting
min_class_count = pd.Series(y).value_counts().min()
safe_splits = min(5, min_class_count)
skf = StratifiedKFold(n_splits=safe_splits, shuffle=True, random_state=42)

UndefinedMetricWarning: F-score is ill-defined and being set to 0.0

You get this when a slice has zero positive or zero predicted positive samples. The zero_division=0 parameter suppresses it, but the real fix is recognizing that slices with fewer than 10-20 samples give unreliable metrics. Filter them out:

1
2
3
4
5
6
7
8
# Only evaluate slices with enough samples
min_samples = 20
for cat in test_df["category"].unique():
    mask = test_df["category"] == cat
    if mask.sum() < min_samples:
        print(f"Skipping {cat} ({mask.sum()} samples) -- too few for reliable metrics")
        continue
    # ... compute metrics

KeyError when slicing by a binned column after pd.cut

If your bin edges do not cover all values in the test set, some rows get NaN for the binned column. Always set inclusive bins:

1
2
3
4
5
# Wrong: misses age 18 if left edge is 18
pd.cut(df["age"], bins=[18, 30, 45, 60, 75])

# Right: use 17 as left edge to include 18
pd.cut(df["age"], bins=[17, 30, 45, 60, 75], labels=["18-30", "31-45", "46-60", "61-75"])

Stratified split does not preserve multi-feature distributions.

StratifiedShuffleSplit only stratifies on the target label. If you need to preserve the joint distribution of target + another feature, create a composite stratification key:

1
2
3
4
# Stratify on target AND category jointly
df["strat_key"] = df["target"].astype(str) + "_" + df["category"]
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
train_idx, test_idx = next(sss.split(df[feature_cols], df["strat_key"]))

This ensures each category has proportional representation in both train and test sets, not just the target class.