Your model has great accuracy, but it performs worse for one demographic group than another. That’s a fairness problem, and accuracy alone won’t catch it. Fairlearn gives you the tools to measure exactly where bias shows up and algorithms to reduce it.
Install fairlearn alongside scikit-learn:
1
| pip install fairlearn scikit-learn pandas
|
Load Data and Train a Baseline#
We’ll use the Adult Census dataset, a classic benchmark for fairness research. It predicts whether someone earns over $50K/year, and the sex column is the sensitive feature we want to audit.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
| import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from fairlearn.metrics import MetricFrame, demographic_parity_difference, equalized_odds_difference
from sklearn.metrics import accuracy_score, balanced_accuracy_score
# Load Adult Census data
from sklearn.datasets import fetch_openml
data = fetch_openml(data_id=1590, as_frame=True)
X = data.data
y = (data.target == ">50K").astype(int)
# The sensitive feature
sensitive = X["sex"]
# Drop non-numeric columns for simplicity
X_numeric = X.select_dtypes(include=["number"]).copy()
X_train, X_test, y_train, y_test, sens_train, sens_test = train_test_split(
X_numeric, y, sensitive, test_size=0.3, random_state=42
)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train a baseline logistic regression
baseline = LogisticRegression(max_iter=1000, random_state=42)
baseline.fit(X_train_scaled, y_train)
y_pred_baseline = baseline.predict(X_test_scaled)
print(f"Baseline accuracy: {accuracy_score(y_test, y_pred_baseline):.3f}")
|
This gives you a working model. Now the question: is it fair?
Measure Fairness with MetricFrame#
MetricFrame breaks down any sklearn metric by sensitive group. You’ll immediately see if your model treats groups differently.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
| metrics = {
"accuracy": accuracy_score,
"balanced_accuracy": balanced_accuracy_score,
}
mf = MetricFrame(
metrics=metrics,
y_true=y_test,
y_pred=y_pred_baseline,
sensitive_features=sens_test,
)
print("=== Per-group metrics ===")
print(mf.by_group)
print()
print("=== Differences (max - min across groups) ===")
print(mf.difference())
print()
dp_diff = demographic_parity_difference(
y_test, y_pred_baseline, sensitive_features=sens_test
)
eo_diff = equalized_odds_difference(
y_test, y_pred_baseline, sensitive_features=sens_test
)
print(f"Demographic parity difference: {dp_diff:.3f}")
print(f"Equalized odds difference: {eo_diff:.3f}")
|
What the numbers mean:
- Demographic parity difference measures the gap in positive prediction rates between groups. A value of 0 means both groups get positive predictions at the same rate. On this dataset, you’ll typically see values around 0.15-0.20, meaning one group is predicted to earn >$50K much more often.
- Equalized odds difference measures the gap in true positive and false positive rates. It tells you whether the model’s errors are distributed unevenly.
Any value above 0.05-0.10 deserves attention.
Mitigate Bias with ExponentiatedGradient#
Fairlearn’s ExponentiatedGradient retrains your model while enforcing a fairness constraint. It’s a reduction-based approach: it converts the fairness problem into a sequence of cost-sensitive classification problems.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
| from fairlearn.reductions import ExponentiatedGradient, DemographicParity
constraint = DemographicParity()
mitigator = ExponentiatedGradient(
estimator=LogisticRegression(max_iter=1000, random_state=42),
constraints=constraint,
)
mitigator.fit(X_train_scaled, y_train, sensitive_features=sens_train)
y_pred_mitigated = mitigator.predict(X_test_scaled)
# Compare before and after
dp_after = demographic_parity_difference(
y_test, y_pred_mitigated, sensitive_features=sens_test
)
eo_after = equalized_odds_difference(
y_test, y_pred_mitigated, sensitive_features=sens_test
)
acc_after = accuracy_score(y_test, y_pred_mitigated)
print("=== Before Mitigation ===")
print(f"Accuracy: {accuracy_score(y_test, y_pred_baseline):.3f}")
print(f"Demographic parity diff: {dp_diff:.3f}")
print(f"Equalized odds diff: {eo_diff:.3f}")
print()
print("=== After Mitigation (ExponentiatedGradient) ===")
print(f"Accuracy: {acc_after:.3f}")
print(f"Demographic parity diff: {dp_after:.3f}")
print(f"Equalized odds diff: {eo_after:.3f}")
|
You’ll typically see the demographic parity difference drop from ~0.18 to under 0.02, at the cost of a few percentage points of accuracy. That’s the fairness-accuracy tradeoff, and it’s usually worth it.
Post-Processing with ThresholdOptimizer#
If you can’t retrain your model (maybe it’s already deployed, or training is expensive), ThresholdOptimizer adjusts prediction thresholds per group to equalize outcomes after the fact.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| from fairlearn.postprocessing import ThresholdOptimizer
postprocessor = ThresholdOptimizer(
estimator=baseline,
constraints="demographic_parity",
objective="balanced_accuracy_score",
prefit=True,
)
postprocessor.fit(X_test_scaled, y_test, sensitive_features=sens_test)
y_pred_post = postprocessor.predict(X_test_scaled, sensitive_features=sens_test)
dp_post = demographic_parity_difference(
y_test, y_pred_post, sensitive_features=sens_test
)
print(f"ThresholdOptimizer demographic parity diff: {dp_post:.3f}")
print(f"ThresholdOptimizer accuracy: {accuracy_score(y_test, y_pred_post):.3f}")
|
ThresholdOptimizer works on any classifier that outputs probabilities. Set prefit=True when passing an already-trained model.
Build a Full Pipeline#
Here’s a compact end-to-end script that loads data, trains, audits, mitigates, and compares:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
| import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from fairlearn.metrics import MetricFrame, demographic_parity_difference
from fairlearn.reductions import ExponentiatedGradient, EqualizedOdds
# Data
data = fetch_openml(data_id=1590, as_frame=True)
X = data.data.select_dtypes(include=["number"]).copy()
y = (data.target == ">50K").astype(int)
sensitive = data.data["sex"]
X_train, X_test, y_train, y_test, s_train, s_test = train_test_split(
X, y, sensitive, test_size=0.3, random_state=42
)
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)
# Baseline
gbc = GradientBoostingClassifier(n_estimators=100, random_state=42)
gbc.fit(X_train_s, y_train)
y_base = gbc.predict(X_test_s)
# Mitigated (equalized odds this time)
mitigator = ExponentiatedGradient(
estimator=GradientBoostingClassifier(n_estimators=100, random_state=42),
constraints=EqualizedOdds(),
)
mitigator.fit(X_train_s, y_train, sensitive_features=s_train)
y_fair = mitigator.predict(X_test_s)
# Report
for label, preds in [("Baseline", y_base), ("Mitigated", y_fair)]:
mf = MetricFrame(
metrics={"accuracy": accuracy_score},
y_true=y_test,
y_pred=preds,
sensitive_features=s_test,
)
dp = demographic_parity_difference(y_test, preds, sensitive_features=s_test)
print(f"{label}:")
print(f" Overall accuracy: {mf.overall['accuracy']:.3f}")
print(f" Per-group accuracy: {dict(mf.by_group['accuracy'])}")
print(f" Demographic parity diff: {dp:.3f}")
print()
|
This gives you a clear before/after comparison. EqualizedOdds as the constraint forces the model to equalize both true positive and false positive rates across groups — a stricter requirement than demographic parity.
Common Errors and Fixes#
ValueError: sensitive_features has X samples, but y has Y samples
The sensitive feature array and your labels must have the same length and align row-by-row. This usually happens when you forget to split the sensitive features along with X and y:
1
2
3
4
5
6
7
8
9
| # Wrong: using the full sensitive array with test labels
y_pred = model.predict(X_test)
dp = demographic_parity_difference(y_test, y_pred, sensitive_features=sensitive) # mismatched lengths
# Right: split sensitive features alongside X and y
X_train, X_test, y_train, y_test, s_train, s_test = train_test_split(
X, y, sensitive, test_size=0.3, random_state=42
)
dp = demographic_parity_difference(y_test, y_pred, sensitive_features=s_test)
|
UserWarning: No data for group ... from MetricFrame
Your sensitive feature column contains NaN values. Drop or impute them before splitting:
1
2
3
4
| mask = sensitive.notna()
X = X[mask]
y = y[mask]
sensitive = sensitive[mask]
|
ThresholdOptimizer raises NotFittedError with prefit=True
You passed an unfitted estimator but told Fairlearn it’s already fitted. Either train first or remove the flag:
1
2
3
4
5
6
7
| # Option 1: train first, then pass prefit=True
model.fit(X_train, y_train)
opt = ThresholdOptimizer(estimator=model, constraints="demographic_parity", prefit=True)
# Option 2: let ThresholdOptimizer train the model itself
opt = ThresholdOptimizer(estimator=model, constraints="demographic_parity", prefit=False)
opt.fit(X_train, y_train, sensitive_features=s_train)
|
ExponentiatedGradient converges slowly or times out
The default runs for 50 iterations. For large datasets or complex models, increase max_iter or use a simpler base estimator:
1
2
3
4
5
6
| mitigator = ExponentiatedGradient(
estimator=LogisticRegression(max_iter=1000),
constraints=DemographicParity(),
max_iter=100, # more iterations for convergence
eps=0.01, # relax the constraint slightly
)
|
Reducing eps tightens the fairness constraint but makes convergence harder. Start with eps=0.01 and decrease only if you need tighter guarantees.