How to Build Fairness-Aware ML Pipelines with Fairlearn

Your model has great accuracy, but it performs worse for one demographic group than another. That’s a fairness problem, and accuracy alone won’t catch it. Fairlearn gives you the tools to measure exactly where bias shows up and algorithms to reduce it.

Install fairlearn alongside scikit-learn:

1
pip install fairlearn scikit-learn pandas

Load Data and Train a Baseline

We’ll use the Adult Census dataset, a classic benchmark for fairness research. It predicts whether someone earns over $50K/year, and the sex column is the sensitive feature we want to audit.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from fairlearn.metrics import MetricFrame, demographic_parity_difference, equalized_odds_difference
from sklearn.metrics import accuracy_score, balanced_accuracy_score

# Load Adult Census data
from sklearn.datasets import fetch_openml
data = fetch_openml(data_id=1590, as_frame=True)
X = data.data
y = (data.target == ">50K").astype(int)

# The sensitive feature
sensitive = X["sex"]

# Drop non-numeric columns for simplicity
X_numeric = X.select_dtypes(include=["number"]).copy()

X_train, X_test, y_train, y_test, sens_train, sens_test = train_test_split(
    X_numeric, y, sensitive, test_size=0.3, random_state=42
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train a baseline logistic regression
baseline = LogisticRegression(max_iter=1000, random_state=42)
baseline.fit(X_train_scaled, y_train)
y_pred_baseline = baseline.predict(X_test_scaled)

print(f"Baseline accuracy: {accuracy_score(y_test, y_pred_baseline):.3f}")

This gives you a working model. Now the question: is it fair?

Measure Fairness with MetricFrame

MetricFrame breaks down any sklearn metric by sensitive group. You’ll immediately see if your model treats groups differently.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
metrics = {
    "accuracy": accuracy_score,
    "balanced_accuracy": balanced_accuracy_score,
}

mf = MetricFrame(
    metrics=metrics,
    y_true=y_test,
    y_pred=y_pred_baseline,
    sensitive_features=sens_test,
)

print("=== Per-group metrics ===")
print(mf.by_group)
print()
print("=== Differences (max - min across groups) ===")
print(mf.difference())
print()

dp_diff = demographic_parity_difference(
    y_test, y_pred_baseline, sensitive_features=sens_test
)
eo_diff = equalized_odds_difference(
    y_test, y_pred_baseline, sensitive_features=sens_test
)
print(f"Demographic parity difference: {dp_diff:.3f}")
print(f"Equalized odds difference:     {eo_diff:.3f}")

What the numbers mean:

Demographic parity difference measures the gap in positive prediction rates between groups. A value of 0 means both groups get positive predictions at the same rate. On this dataset, you’ll typically see values around 0.15-0.20, meaning one group is predicted to earn >$50K much more often.
Equalized odds difference measures the gap in true positive and false positive rates. It tells you whether the model’s errors are distributed unevenly.

Any value above 0.05-0.10 deserves attention.

Mitigate Bias with ExponentiatedGradient

Fairlearn’s ExponentiatedGradient retrains your model while enforcing a fairness constraint. It’s a reduction-based approach: it converts the fairness problem into a sequence of cost-sensitive classification problems.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
from fairlearn.reductions import ExponentiatedGradient, DemographicParity

constraint = DemographicParity()
mitigator = ExponentiatedGradient(
    estimator=LogisticRegression(max_iter=1000, random_state=42),
    constraints=constraint,
)

mitigator.fit(X_train_scaled, y_train, sensitive_features=sens_train)
y_pred_mitigated = mitigator.predict(X_test_scaled)

# Compare before and after
dp_after = demographic_parity_difference(
    y_test, y_pred_mitigated, sensitive_features=sens_test
)
eo_after = equalized_odds_difference(
    y_test, y_pred_mitigated, sensitive_features=sens_test
)
acc_after = accuracy_score(y_test, y_pred_mitigated)

print("=== Before Mitigation ===")
print(f"Accuracy:                   {accuracy_score(y_test, y_pred_baseline):.3f}")
print(f"Demographic parity diff:    {dp_diff:.3f}")
print(f"Equalized odds diff:        {eo_diff:.3f}")
print()
print("=== After Mitigation (ExponentiatedGradient) ===")
print(f"Accuracy:                   {acc_after:.3f}")
print(f"Demographic parity diff:    {dp_after:.3f}")
print(f"Equalized odds diff:        {eo_after:.3f}")

You’ll typically see the demographic parity difference drop from ~0.18 to under 0.02, at the cost of a few percentage points of accuracy. That’s the fairness-accuracy tradeoff, and it’s usually worth it.

Post-Processing with ThresholdOptimizer

If you can’t retrain your model (maybe it’s already deployed, or training is expensive), ThresholdOptimizer adjusts prediction thresholds per group to equalize outcomes after the fact.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
from fairlearn.postprocessing import ThresholdOptimizer

postprocessor = ThresholdOptimizer(
    estimator=baseline,
    constraints="demographic_parity",
    objective="balanced_accuracy_score",
    prefit=True,
)

postprocessor.fit(X_test_scaled, y_test, sensitive_features=sens_test)
y_pred_post = postprocessor.predict(X_test_scaled, sensitive_features=sens_test)

dp_post = demographic_parity_difference(
    y_test, y_pred_post, sensitive_features=sens_test
)
print(f"ThresholdOptimizer demographic parity diff: {dp_post:.3f}")
print(f"ThresholdOptimizer accuracy:                {accuracy_score(y_test, y_pred_post):.3f}")

ThresholdOptimizer works on any classifier that outputs probabilities. Set prefit=True when passing an already-trained model.

Build a Full Pipeline

Here’s a compact end-to-end script that loads data, trains, audits, mitigates, and compares:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from fairlearn.metrics import MetricFrame, demographic_parity_difference
from fairlearn.reductions import ExponentiatedGradient, EqualizedOdds

# Data
data = fetch_openml(data_id=1590, as_frame=True)
X = data.data.select_dtypes(include=["number"]).copy()
y = (data.target == ">50K").astype(int)
sensitive = data.data["sex"]

X_train, X_test, y_train, y_test, s_train, s_test = train_test_split(
    X, y, sensitive, test_size=0.3, random_state=42
)

scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

# Baseline
gbc = GradientBoostingClassifier(n_estimators=100, random_state=42)
gbc.fit(X_train_s, y_train)
y_base = gbc.predict(X_test_s)

# Mitigated (equalized odds this time)
mitigator = ExponentiatedGradient(
    estimator=GradientBoostingClassifier(n_estimators=100, random_state=42),
    constraints=EqualizedOdds(),
)
mitigator.fit(X_train_s, y_train, sensitive_features=s_train)
y_fair = mitigator.predict(X_test_s)

# Report
for label, preds in [("Baseline", y_base), ("Mitigated", y_fair)]:
    mf = MetricFrame(
        metrics={"accuracy": accuracy_score},
        y_true=y_test,
        y_pred=preds,
        sensitive_features=s_test,
    )
    dp = demographic_parity_difference(y_test, preds, sensitive_features=s_test)
    print(f"{label}:")
    print(f"  Overall accuracy:          {mf.overall['accuracy']:.3f}")
    print(f"  Per-group accuracy:        {dict(mf.by_group['accuracy'])}")
    print(f"  Demographic parity diff:   {dp:.3f}")
    print()

This gives you a clear before/after comparison. EqualizedOdds as the constraint forces the model to equalize both true positive and false positive rates across groups — a stricter requirement than demographic parity.

Common Errors and Fixes

ValueError: sensitive_features has X samples, but y has Y samples

The sensitive feature array and your labels must have the same length and align row-by-row. This usually happens when you forget to split the sensitive features along with X and y:

1
2
3
4
5
6
7
8
9
# Wrong: using the full sensitive array with test labels
y_pred = model.predict(X_test)
dp = demographic_parity_difference(y_test, y_pred, sensitive_features=sensitive)  # mismatched lengths

# Right: split sensitive features alongside X and y
X_train, X_test, y_train, y_test, s_train, s_test = train_test_split(
    X, y, sensitive, test_size=0.3, random_state=42
)
dp = demographic_parity_difference(y_test, y_pred, sensitive_features=s_test)

UserWarning: No data for group ... from MetricFrame

Your sensitive feature column contains NaN values. Drop or impute them before splitting:

1
2
3
4
mask = sensitive.notna()
X = X[mask]
y = y[mask]
sensitive = sensitive[mask]

ThresholdOptimizer raises NotFittedError with prefit=True

You passed an unfitted estimator but told Fairlearn it’s already fitted. Either train first or remove the flag:

1
2
3
4
5
6
7
# Option 1: train first, then pass prefit=True
model.fit(X_train, y_train)
opt = ThresholdOptimizer(estimator=model, constraints="demographic_parity", prefit=True)

# Option 2: let ThresholdOptimizer train the model itself
opt = ThresholdOptimizer(estimator=model, constraints="demographic_parity", prefit=False)
opt.fit(X_train, y_train, sensitive_features=s_train)

ExponentiatedGradient converges slowly or times out

The default runs for 50 iterations. For large datasets or complex models, increase max_iter or use a simpler base estimator:

1
2
3
4
5
6
mitigator = ExponentiatedGradient(
    estimator=LogisticRegression(max_iter=1000),
    constraints=DemographicParity(),
    max_iter=100,  # more iterations for convergence
    eps=0.01,      # relax the constraint slightly
)

Reducing eps tightens the fairness constraint but makes convergence harder. Start with eps=0.01 and decrease only if you need tighter guarantees.

Load Data and Train a Baseline#

Measure Fairness with MetricFrame#

Mitigate Bias with ExponentiatedGradient#

Post-Processing with ThresholdOptimizer#

Build a Full Pipeline#

Common Errors and Fixes#

Related Guides#

About the Author