The Quick Version#
When 95% of your data is one class and 5% is another, standard training produces a model that predicts the majority class for everything and still gets 95% accuracy. The model is useless for the rare class you actually care about (fraud, defects, disease).
The fastest fix is class weights — tell the loss function to penalize mistakes on rare classes more heavily:
1
| pip install scikit-learn imbalanced-learn
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
| from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
# Create an imbalanced dataset: 95% class 0, 5% class 1
X, y = make_classification(
n_samples=10000, n_features=20, n_informative=10,
weights=[0.95, 0.05], random_state=42,
)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Without balancing — model ignores minority class
clf_baseline = RandomForestClassifier(random_state=42)
clf_baseline.fit(X_train, y_train)
print("WITHOUT class weights:")
print(classification_report(y_test, clf_baseline.predict(X_test)))
# With class weights — model pays attention to minority class
clf_balanced = RandomForestClassifier(class_weight="balanced", random_state=42)
clf_balanced.fit(X_train, y_train)
print("WITH class weights:")
print(classification_report(y_test, clf_balanced.predict(X_test)))
|
class_weight="balanced" automatically sets weights inversely proportional to class frequencies. The minority class gets ~19x the weight of the majority class (95/5). This alone typically doubles or triples recall on the minority class.
SMOTE: Synthetic Minority Oversampling#
SMOTE creates synthetic examples of the minority class by interpolating between existing samples. It gives the model more diverse examples to learn from without just duplicating the same few points.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| from imblearn.over_sampling import SMOTE
from collections import Counter
print(f"Before SMOTE: {Counter(y_train)}")
# Before SMOTE: Counter({0: 7600, 1: 400})
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
print(f"After SMOTE: {Counter(y_resampled)}")
# After SMOTE: Counter({0: 7600, 1: 7600})
clf_smote = RandomForestClassifier(random_state=42)
clf_smote.fit(X_resampled, y_resampled)
print(classification_report(y_test, clf_smote.predict(X_test)))
|
SMOTE Variants for Different Data Types#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| from imblearn.over_sampling import SMOTENC, ADASYN, BorderlineSMOTE
# SMOTE-NC: handles datasets with both numerical and categorical features
# categorical_features is a list of column indices that are categorical
smotenc = SMOTENC(categorical_features=[0, 5, 12], random_state=42)
X_res, y_res = smotenc.fit_resample(X_train, y_train)
# BorderlineSMOTE: only generates samples near the decision boundary
# More targeted than regular SMOTE — focuses where it matters
border_smote = BorderlineSMOTE(random_state=42)
X_res, y_res = border_smote.fit_resample(X_train, y_train)
# ADASYN: generates more samples in regions where the model struggles
adasyn = ADASYN(random_state=42)
X_res, y_res = adasyn.fit_resample(X_train, y_train)
|
BorderlineSMOTE is usually the best choice — it generates synthetic samples where they’re most needed (near the class boundary) instead of uniformly across the minority class.
Undersampling the Majority Class#
Instead of creating more minority samples, remove majority samples. This works well when you have plenty of data and the majority class has redundant examples.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| from imblearn.under_sampling import RandomUnderSampler, TomekLinks
from imblearn.combine import SMOTETomek
# Random undersampling — fast but loses information
rus = RandomUnderSampler(random_state=42)
X_under, y_under = rus.fit_resample(X_train, y_train)
print(f"After undersampling: {Counter(y_under)}")
# After undersampling: Counter({0: 400, 1: 400})
# Tomek Links — removes majority samples that are nearest neighbors of minority
# Cleans up the decision boundary
tomek = TomekLinks()
X_clean, y_clean = tomek.fit_resample(X_train, y_train)
# Best of both: SMOTE + Tomek Links
smt = SMOTETomek(random_state=42)
X_combined, y_combined = smt.fit_resample(X_train, y_train)
|
The downside of undersampling: you throw away potentially useful data. With 10,000 majority samples reduced to 400, you lose 96% of your majority class information. Use this when you have millions of samples and can afford to discard most of the majority.
Class Weights in Deep Learning#
For PyTorch models, compute class weights and pass them to the loss function:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| import torch
import torch.nn as nn
import numpy as np
# Compute weights from training labels
class_counts = np.bincount(y_train)
total = len(y_train)
weights = torch.tensor([total / (len(class_counts) * c) for c in class_counts], dtype=torch.float32)
print(f"Class weights: {weights}")
# Class weights: tensor([0.5263, 10.0000])
# Use in CrossEntropyLoss
criterion = nn.CrossEntropyLoss(weight=weights.cuda())
# Or for binary classification with BCEWithLogitsLoss
pos_weight = torch.tensor([class_counts[0] / class_counts[1]])
criterion = nn.BCEWithLogitsLoss(pos_weight=pos_weight.cuda())
|
Focal Loss for Hard Examples#
Focal loss down-weights easy examples and focuses training on hard, misclassified samples. It’s especially effective for extreme imbalance (1:100 or worse):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| class FocalLoss(nn.Module):
def __init__(self, alpha: float = 0.25, gamma: float = 2.0):
super().__init__()
self.alpha = alpha
self.gamma = gamma
def forward(self, inputs: torch.Tensor, targets: torch.Tensor) -> torch.Tensor:
bce_loss = nn.functional.binary_cross_entropy_with_logits(inputs, targets, reduction="none")
probs = torch.sigmoid(inputs)
p_t = probs * targets + (1 - probs) * (1 - targets)
focal_weight = self.alpha * (1 - p_t) ** self.gamma
return (focal_weight * bce_loss).mean()
criterion = FocalLoss(alpha=0.25, gamma=2.0)
|
gamma=2.0 is the standard starting point. Higher gamma focuses more aggressively on hard examples. alpha balances positive vs. negative classes — set it to the minority class frequency (e.g., 0.05 for 5% minority).
Evaluation: Stop Using Accuracy#
Accuracy is meaningless for imbalanced data. A model that always predicts “not fraud” gets 99.5% accuracy on a dataset with 0.5% fraud. Use these metrics instead:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
| from sklearn.metrics import (
precision_recall_curve, average_precision_score,
f1_score, roc_auc_score, confusion_matrix,
)
import numpy as np
y_pred = clf_balanced.predict(X_test)
y_proba = clf_balanced.predict_proba(X_test)[:, 1]
print(f"F1 (minority): {f1_score(y_test, y_pred):.3f}")
print(f"AUC-ROC: {roc_auc_score(y_test, y_proba):.3f}")
print(f"Average Precision: {average_precision_score(y_test, y_proba):.3f}")
cm = confusion_matrix(y_test, y_pred)
print(f"\nConfusion Matrix:")
print(f" TN={cm[0][0]}, FP={cm[0][1]}")
print(f" FN={cm[1][0]}, TP={cm[1][1]}")
# Precision-Recall curve is more informative than ROC for imbalanced data
precisions, recalls, thresholds = precision_recall_curve(y_test, y_proba)
print(f"\nAt 90% recall, precision = {precisions[np.searchsorted(recalls[::-1], 0.9)]:.3f}")
|
Average Precision (area under the precision-recall curve) is the single best metric for imbalanced classification. Unlike AUC-ROC, it doesn’t get inflated by true negatives.
Common Errors and Fixes#
SMOTE on the full dataset before train/test split
Always split first, then apply SMOTE to the training set only. Applying SMOTE before splitting leaks information from the test set into synthetic training samples, inflating your metrics.
1
2
3
4
5
6
7
| # WRONG
X_res, y_res = smote.fit_resample(X, y)
X_train, X_test = train_test_split(X_res, y_res)
# RIGHT
X_train, X_test, y_train, y_test = train_test_split(X, y)
X_train_res, y_train_res = smote.fit_resample(X_train, y_train)
|
Model has high recall but terrible precision
You’ve over-corrected. The model flags everything as the minority class to avoid missing any. Reduce class weights or use a less aggressive resampling ratio (e.g., 1:2 instead of 1:1).
SMOTE creates noise in high-dimensional data
With many features, SMOTE interpolates in sparse regions and creates unrealistic samples. Reduce dimensionality first (PCA to 20-50 features), then apply SMOTE.
Cross-validation scores vary wildly
Use stratified k-fold to maintain class ratios in each fold. Apply resampling inside each fold, not before splitting:
1
2
3
4
5
6
7
8
9
10
11
| from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.model_selection import cross_val_score, StratifiedKFold
pipeline = ImbPipeline([
("smote", SMOTE(random_state=42)),
("clf", RandomForestClassifier(random_state=42)),
])
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(pipeline, X, y, cv=cv, scoring="average_precision")
print(f"AP scores: {scores.mean():.3f} +/- {scores.std():.3f}")
|