How to Handle Imbalanced Datasets for ML Training

The Quick Version

When 95% of your data is one class and 5% is another, standard training produces a model that predicts the majority class for everything and still gets 95% accuracy. The model is useless for the rare class you actually care about (fraud, defects, disease).

The fastest fix is class weights — tell the loss function to penalize mistakes on rare classes more heavily:

1
pip install scikit-learn imbalanced-learn

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Create an imbalanced dataset: 95% class 0, 5% class 1
X, y = make_classification(
    n_samples=10000, n_features=20, n_informative=10,
    weights=[0.95, 0.05], random_state=42,
)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Without balancing — model ignores minority class
clf_baseline = RandomForestClassifier(random_state=42)
clf_baseline.fit(X_train, y_train)
print("WITHOUT class weights:")
print(classification_report(y_test, clf_baseline.predict(X_test)))

# With class weights — model pays attention to minority class
clf_balanced = RandomForestClassifier(class_weight="balanced", random_state=42)
clf_balanced.fit(X_train, y_train)
print("WITH class weights:")
print(classification_report(y_test, clf_balanced.predict(X_test)))

class_weight="balanced" automatically sets weights inversely proportional to class frequencies. The minority class gets ~19x the weight of the majority class (95/5). This alone typically doubles or triples recall on the minority class.

SMOTE: Synthetic Minority Oversampling

SMOTE creates synthetic examples of the minority class by interpolating between existing samples. It gives the model more diverse examples to learn from without just duplicating the same few points.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
from imblearn.over_sampling import SMOTE
from collections import Counter

print(f"Before SMOTE: {Counter(y_train)}")
# Before SMOTE: Counter({0: 7600, 1: 400})

smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

print(f"After SMOTE: {Counter(y_resampled)}")
# After SMOTE: Counter({0: 7600, 1: 7600})

clf_smote = RandomForestClassifier(random_state=42)
clf_smote.fit(X_resampled, y_resampled)
print(classification_report(y_test, clf_smote.predict(X_test)))

SMOTE Variants for Different Data Types

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
from imblearn.over_sampling import SMOTENC, ADASYN, BorderlineSMOTE

# SMOTE-NC: handles datasets with both numerical and categorical features
# categorical_features is a list of column indices that are categorical
smotenc = SMOTENC(categorical_features=[0, 5, 12], random_state=42)
X_res, y_res = smotenc.fit_resample(X_train, y_train)

# BorderlineSMOTE: only generates samples near the decision boundary
# More targeted than regular SMOTE — focuses where it matters
border_smote = BorderlineSMOTE(random_state=42)
X_res, y_res = border_smote.fit_resample(X_train, y_train)

# ADASYN: generates more samples in regions where the model struggles
adasyn = ADASYN(random_state=42)
X_res, y_res = adasyn.fit_resample(X_train, y_train)

BorderlineSMOTE is usually the best choice — it generates synthetic samples where they’re most needed (near the class boundary) instead of uniformly across the minority class.

Undersampling the Majority Class

Instead of creating more minority samples, remove majority samples. This works well when you have plenty of data and the majority class has redundant examples.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
from imblearn.under_sampling import RandomUnderSampler, TomekLinks
from imblearn.combine import SMOTETomek

# Random undersampling — fast but loses information
rus = RandomUnderSampler(random_state=42)
X_under, y_under = rus.fit_resample(X_train, y_train)
print(f"After undersampling: {Counter(y_under)}")
# After undersampling: Counter({0: 400, 1: 400})

# Tomek Links — removes majority samples that are nearest neighbors of minority
# Cleans up the decision boundary
tomek = TomekLinks()
X_clean, y_clean = tomek.fit_resample(X_train, y_train)

# Best of both: SMOTE + Tomek Links
smt = SMOTETomek(random_state=42)
X_combined, y_combined = smt.fit_resample(X_train, y_train)

The downside of undersampling: you throw away potentially useful data. With 10,000 majority samples reduced to 400, you lose 96% of your majority class information. Use this when you have millions of samples and can afford to discard most of the majority.

Class Weights in Deep Learning

For PyTorch models, compute class weights and pass them to the loss function:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
import torch
import torch.nn as nn
import numpy as np

# Compute weights from training labels
class_counts = np.bincount(y_train)
total = len(y_train)
weights = torch.tensor([total / (len(class_counts) * c) for c in class_counts], dtype=torch.float32)
print(f"Class weights: {weights}")
# Class weights: tensor([0.5263, 10.0000])

# Use in CrossEntropyLoss
criterion = nn.CrossEntropyLoss(weight=weights.cuda())

# Or for binary classification with BCEWithLogitsLoss
pos_weight = torch.tensor([class_counts[0] / class_counts[1]])
criterion = nn.BCEWithLogitsLoss(pos_weight=pos_weight.cuda())

Focal Loss for Hard Examples

Focal loss down-weights easy examples and focuses training on hard, misclassified samples. It’s especially effective for extreme imbalance (1:100 or worse):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
class FocalLoss(nn.Module):
    def __init__(self, alpha: float = 0.25, gamma: float = 2.0):
        super().__init__()
        self.alpha = alpha
        self.gamma = gamma

    def forward(self, inputs: torch.Tensor, targets: torch.Tensor) -> torch.Tensor:
        bce_loss = nn.functional.binary_cross_entropy_with_logits(inputs, targets, reduction="none")
        probs = torch.sigmoid(inputs)
        p_t = probs * targets + (1 - probs) * (1 - targets)
        focal_weight = self.alpha * (1 - p_t) ** self.gamma
        return (focal_weight * bce_loss).mean()

criterion = FocalLoss(alpha=0.25, gamma=2.0)

gamma=2.0 is the standard starting point. Higher gamma focuses more aggressively on hard examples. alpha balances positive vs. negative classes — set it to the minority class frequency (e.g., 0.05 for 5% minority).

Evaluation: Stop Using Accuracy

Accuracy is meaningless for imbalanced data. A model that always predicts “not fraud” gets 99.5% accuracy on a dataset with 0.5% fraud. Use these metrics instead:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
from sklearn.metrics import (
    precision_recall_curve, average_precision_score,
    f1_score, roc_auc_score, confusion_matrix,
)
import numpy as np

y_pred = clf_balanced.predict(X_test)
y_proba = clf_balanced.predict_proba(X_test)[:, 1]

print(f"F1 (minority):     {f1_score(y_test, y_pred):.3f}")
print(f"AUC-ROC:           {roc_auc_score(y_test, y_proba):.3f}")
print(f"Average Precision:  {average_precision_score(y_test, y_proba):.3f}")

cm = confusion_matrix(y_test, y_pred)
print(f"\nConfusion Matrix:")
print(f"  TN={cm[0][0]}, FP={cm[0][1]}")
print(f"  FN={cm[1][0]}, TP={cm[1][1]}")

# Precision-Recall curve is more informative than ROC for imbalanced data
precisions, recalls, thresholds = precision_recall_curve(y_test, y_proba)
print(f"\nAt 90% recall, precision = {precisions[np.searchsorted(recalls[::-1], 0.9)]:.3f}")

Average Precision (area under the precision-recall curve) is the single best metric for imbalanced classification. Unlike AUC-ROC, it doesn’t get inflated by true negatives.

Common Errors and Fixes

SMOTE on the full dataset before train/test split

Always split first, then apply SMOTE to the training set only. Applying SMOTE before splitting leaks information from the test set into synthetic training samples, inflating your metrics.

1
2
3
4
5
6
7
# WRONG
X_res, y_res = smote.fit_resample(X, y)
X_train, X_test = train_test_split(X_res, y_res)

# RIGHT
X_train, X_test, y_train, y_test = train_test_split(X, y)
X_train_res, y_train_res = smote.fit_resample(X_train, y_train)

Model has high recall but terrible precision

You’ve over-corrected. The model flags everything as the minority class to avoid missing any. Reduce class weights or use a less aggressive resampling ratio (e.g., 1:2 instead of 1:1).

SMOTE creates noise in high-dimensional data

With many features, SMOTE interpolates in sparse regions and creates unrealistic samples. Reduce dimensionality first (PCA to 20-50 features), then apply SMOTE.

Cross-validation scores vary wildly

Use stratified k-fold to maintain class ratios in each fold. Apply resampling inside each fold, not before splitting:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.model_selection import cross_val_score, StratifiedKFold

pipeline = ImbPipeline([
    ("smote", SMOTE(random_state=42)),
    ("clf", RandomForestClassifier(random_state=42)),
])

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(pipeline, X, y, cv=cv, scoring="average_precision")
print(f"AP scores: {scores.mean():.3f} +/- {scores.std():.3f}")

The Quick Version#

SMOTE: Synthetic Minority Oversampling#

SMOTE Variants for Different Data Types#

Undersampling the Majority Class#

Class Weights in Deep Learning#

Focal Loss for Hard Examples#

Evaluation: Stop Using Accuracy#

Common Errors and Fixes#

Related Guides#

About the Author