How to Build Membership Inference Attack Detection for ML Models

Why Membership Inference Matters

A membership inference attack answers a simple question: was this specific data point used to train the model? If an attacker can figure that out, they learn something about your training set. For medical models, that means knowing someone’s health data was included. For financial models, it leaks who was in the dataset. This is a real privacy risk, and regulators care about it.

The good news: you can run these attacks yourself as an audit. If your own model is vulnerable, fix it before someone else exploits it.

1
pip install scikit-learn numpy

Here’s the core idea in code. Train a classifier, then check if it’s more confident on training data than on unseen data:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Generate sample data
X, y = make_classification(n_samples=5000, n_features=20, n_informative=15,
                           n_redundant=3, random_state=42)

# Split into target model's train set and a held-out set
X_train, X_nonmember, y_train, y_nonmember = train_test_split(
    X, y, test_size=0.5, random_state=42
)

# Train the target model
target_model = RandomForestClassifier(n_estimators=100, random_state=42)
target_model.fit(X_train, y_train)

# Get prediction probabilities for both sets
train_probs = target_model.predict_proba(X_train)
nonmember_probs = target_model.predict_proba(X_nonmember)

# Check the confidence gap
train_max_conf = np.max(train_probs, axis=1)
nonmember_max_conf = np.max(nonmember_probs, axis=1)

print(f"Mean confidence on TRAINING data:    {train_max_conf.mean():.4f}")
print(f"Mean confidence on NON-MEMBER data:   {nonmember_max_conf.mean():.4f}")
print(f"Confidence gap:                       {train_max_conf.mean() - nonmember_max_conf.mean():.4f}")

If the confidence gap is large, the model has memorized its training data. A perfectly generalizing model would show similar confidence on both sets. Most real models don’t generalize perfectly, and that gap is what an attacker exploits.

The Shadow Model Approach

The shadow model technique was introduced by Shokri et al. and remains the most practical method for membership inference. You train multiple “shadow” models that mimic the target model’s behavior. Since you control the shadow models, you know exactly which data points were in their training sets. That gives you labeled data to train an attack classifier.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report

def build_shadow_dataset(X, y, n_shadows=5, model_class=RandomForestClassifier,
                         model_kwargs=None):
    """
    Train shadow models and collect (confidence_vector, member_label) pairs.
    """
    if model_kwargs is None:
        model_kwargs = {"n_estimators": 100}

    attack_features = []
    attack_labels = []

    for i in range(n_shadows):
        # Each shadow model gets a different random split
        X_shadow_train, X_shadow_out, y_shadow_train, y_shadow_out = train_test_split(
            X, y, test_size=0.5, random_state=i * 17 + 3
        )

        shadow = model_class(random_state=i, **model_kwargs)
        shadow.fit(X_shadow_train, y_shadow_train)

        # Members: data used to train this shadow model
        member_probs = shadow.predict_proba(X_shadow_train)
        member_preds = shadow.predict(X_shadow_train)
        for prob, pred, true_label in zip(member_probs, member_preds, y_shadow_train):
            features = list(prob) + [float(pred == true_label)]
            attack_features.append(features)
            attack_labels.append(1)  # member

        # Non-members: data NOT used to train this shadow model
        nonmember_probs = shadow.predict_proba(X_shadow_out)
        nonmember_preds = shadow.predict(X_shadow_out)
        for prob, pred, true_label in zip(nonmember_probs, nonmember_preds, y_shadow_out):
            features = list(prob) + [float(pred == true_label)]
            attack_features.append(features)
            attack_labels.append(0)  # non-member

    return np.array(attack_features), np.array(attack_labels)


# Build the attack dataset from shadow models
shadow_features, shadow_labels = build_shadow_dataset(X, y, n_shadows=5)

# Train the attack classifier
X_atk_train, X_atk_test, y_atk_train, y_atk_test = train_test_split(
    shadow_features, shadow_labels, test_size=0.3, random_state=42
)

attack_model = GradientBoostingClassifier(n_estimators=100, random_state=42)
attack_model.fit(X_atk_train, y_atk_train)

y_atk_pred = attack_model.predict(X_atk_test)
print("Attack Model Performance (on shadow data):")
print(classification_report(y_atk_test, y_atk_pred, target_names=["Non-member", "Member"]))

Each shadow model produces labeled examples: “this confidence vector came from a member” or “this came from a non-member.” Stack them all together and you have a training set for a binary classifier that learns the confidence patterns of memorization.

Five shadow models is a reasonable starting point. More shadows give you a better attack classifier, but the returns diminish after about 10.

Running the Attack Against the Target Model

Now use the trained attack model to audit the actual target model:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
def run_membership_attack(target_model, attack_model, X_members, y_members,
                          X_nonmembers, y_nonmembers):
    """
    Run the membership inference attack and report results.
    """
    # Build features for actual members
    member_probs = target_model.predict_proba(X_members)
    member_preds = target_model.predict(X_members)
    member_features = np.column_stack([
        member_probs,
        (member_preds == y_members).astype(float)
    ])

    # Build features for actual non-members
    nonmember_probs = target_model.predict_proba(X_nonmembers)
    nonmember_preds = target_model.predict(X_nonmembers)
    nonmember_features = np.column_stack([
        nonmember_probs,
        (nonmember_preds == y_nonmembers).astype(float)
    ])

    # Combine and predict
    all_features = np.vstack([member_features, nonmember_features])
    true_labels = np.array([1] * len(X_members) + [0] * len(X_nonmembers))
    predicted_labels = attack_model.predict(all_features)

    print("Membership Inference Attack Results:")
    print(classification_report(true_labels, predicted_labels,
                                target_names=["Non-member", "Member"]))

    attack_accuracy = accuracy_score(true_labels, predicted_labels)
    print(f"Attack accuracy: {attack_accuracy:.4f}")
    print(f"Baseline (random guess): 0.5000")
    print(f"Vulnerability score: {(attack_accuracy - 0.5) * 2:.4f}")
    return attack_accuracy


attack_acc = run_membership_attack(
    target_model, attack_model,
    X_train, y_train,
    X_nonmember, y_nonmember
)

The vulnerability score ranges from 0 (no leakage, equivalent to random guessing) to 1 (perfect membership inference). Anything above 0.1 deserves attention. Above 0.3, your model is seriously leaking membership information.

Interpreting the Results

Attack accuracy around 50%: Your model generalizes well. Membership inference fails because the model treats training and non-training data similarly.
Attack accuracy 55-65%: Moderate vulnerability. Common for well-tuned models on moderately sized datasets. Worth monitoring.
Attack accuracy above 70%: High vulnerability. The model is overfitting and leaking membership. Apply mitigations immediately.
Attack accuracy above 85%: Severe. The model is essentially memorizing its training data. Don’t deploy this without heavy regularization.

Mitigation Strategies

Once you’ve measured the vulnerability, here’s how to reduce it. These are ranked by ease of implementation:

L2 Regularization – The simplest fix. Penalizing large weights reduces overfitting, which directly reduces the confidence gap between members and non-members.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
from sklearn.linear_model import LogisticRegression

# Strong regularization (smaller C = stronger regularization)
regularized_model = LogisticRegression(C=0.01, max_iter=1000, random_state=42)
regularized_model.fit(X_train, y_train)

# Compare confidence distributions
reg_train_conf = np.max(regularized_model.predict_proba(X_train), axis=1)
reg_nonmember_conf = np.max(regularized_model.predict_proba(X_nonmember), axis=1)
print(f"Regularized confidence gap: {reg_train_conf.mean() - reg_nonmember_conf.mean():.4f}")

Early Stopping – Stop training before the model memorizes. For neural networks, monitor validation loss and stop when it stops improving. For tree-based models, limit depth and number of estimators.

Label Smoothing – Instead of hard 0/1 labels, use 0.1/0.9. This prevents the model from becoming too confident on any single example, which directly defeats the confidence-based signal attackers rely on.

Differential Privacy – The gold standard. Add calibrated noise to gradients during training so that no single training example can significantly influence the model. Tools like Opacus (PyTorch) and TensorFlow Privacy make this practical. The tradeoff is accuracy – expect 2-5% accuracy loss depending on your privacy budget.

Prediction Perturbation – Add small random noise to output probabilities at inference time. This is a band-aid, not a fix. It reduces the attacker’s signal without addressing the root cause of overfitting.

My recommendation: start with regularization and early stopping. They’re free in terms of complexity and often sufficient. Only reach for differential privacy if you’re handling genuinely sensitive data and need formal guarantees.

Common Errors and Fixes

ValueError: Found input variables with inconsistent numbers of samples – Your member and non-member arrays have different numbers of features. Make sure both go through predict_proba on the same model. Check that you didn’t accidentally swap train/test splits.

Attack accuracy is exactly 50% – This either means your model generalizes perfectly (unlikely) or your attack classifier is broken. Check that the shadow dataset has balanced labels. Print np.bincount(shadow_labels) to verify roughly equal counts of 0s and 1s.

Attack accuracy is below 50% – Your attack model is worse than random, which means the labels are likely flipped somewhere. Double-check that label 1 means “member” consistently in both the shadow dataset and the evaluation.

predict_proba returns different shapes for different calls – This happens when the model hasn’t seen all classes during training in a particular shadow split. Fix it by ensuring each shadow split has examples of all classes, or by using stratified splitting:

1
2
3
X_shadow_train, X_shadow_out, y_shadow_train, y_shadow_out = train_test_split(
    X, y, test_size=0.5, random_state=i * 17 + 3, stratify=y
)

Out of memory with many shadow models – Don’t store all shadow model objects. Build the attack features incrementally and discard each shadow model after extracting probabilities. The build_shadow_dataset function above already does this correctly.

Confidence values are all very close to 1.0 – Your target model is extremely overfit. This actually makes the attack easier, but means the confidence-based features are less discriminative. Add the correctness feature (pred == true_label) as shown above – it helps the attack classifier distinguish members when confidence alone is saturated.

Why Membership Inference Matters#

The Shadow Model Approach#

Running the Attack Against the Target Model#

Interpreting the Results#

Mitigation Strategies#

Common Errors and Fixes#

Related Guides#

About the Author