How to Train ML Models with Differential Privacy Using Opacus

Wrap Your Training Loop with PrivacyEngine

Opacus lets you train any PyTorch model with differential privacy by wrapping your existing optimizer, model, and data loader. The core idea behind DP-SGD is simple: clip each sample’s gradient to a fixed norm, average them, and add calibrated Gaussian noise before the parameter update. That noise is what gives you a mathematical privacy guarantee.

Install Opacus and its dependencies:

1
pip install opacus torch torchvision

Here’s a minimal working example that trains a CNN on MNIST with differential privacy:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
from opacus import PrivacyEngine
from opacus.validators import ModuleValidator

# Define a simple CNN
class SmallCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 16, 3, padding=1)
        self.conv2 = nn.Conv2d(16, 32, 3, padding=1)
        self.fc1 = nn.Linear(32 * 7 * 7, 128)
        self.fc2 = nn.Linear(128, 10)
        self.pool = nn.MaxPool2d(2)
        self.relu = nn.ReLU()
        # Use GroupNorm instead of BatchNorm -- BatchNorm leaks info across samples
        self.norm1 = nn.GroupNorm(4, 16)
        self.norm2 = nn.GroupNorm(4, 32)

    def forward(self, x):
        x = self.pool(self.relu(self.norm1(self.conv1(x))))
        x = self.pool(self.relu(self.norm2(self.conv2(x))))
        x = x.view(x.size(0), -1)
        x = self.relu(self.fc1(x))
        return self.fc2(x)

# Data
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,)),
])
train_dataset = datasets.MNIST("./data", train=True, download=True, transform=transform)
train_loader = DataLoader(train_dataset, batch_size=256, shuffle=True)

test_dataset = datasets.MNIST("./data", train=False, transform=transform)
test_loader = DataLoader(test_dataset, batch_size=1024)

# Model, optimizer
model = SmallCNN()
optimizer = optim.Adam(model.parameters(), lr=1e-3)

# --- Attach differential privacy ---
privacy_engine = PrivacyEngine()
model, optimizer, train_loader = privacy_engine.make_private(
    module=model,
    optimizer=optimizer,
    data_loader=train_loader,
    noise_multiplier=1.1,
    max_grad_norm=1.0,
)

# Training loop
model.train()
for epoch in range(5):
    total_loss = 0
    for batch_idx, (data, target) in enumerate(train_loader):
        optimizer.zero_grad()
        output = model(data)
        loss = nn.CrossEntropyLoss()(output, target)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()

    epsilon = privacy_engine.get_epsilon(delta=1e-5)
    print(f"Epoch {epoch+1} | Loss: {total_loss/len(train_loader):.4f} | epsilon: {epsilon:.2f}")

Three lines changed your standard PyTorch training into a differentially private one: instantiating PrivacyEngine, calling make_private, and querying get_epsilon to track your budget.

Understanding Epsilon, Delta, and the Privacy Budget

The privacy guarantee is expressed as (epsilon, delta)-differential privacy. Epsilon measures how much any single training example can influence the model’s output. Delta is the probability the guarantee fails entirely.

Practical guidelines for choosing these values:

Delta should be less than 1/N where N is the training set size. For MNIST (60,000 samples), delta=1e-5 is standard.
Epsilon under 1 means strong privacy but usually significant accuracy loss. Epsilon between 1-10 is a reasonable range for production models. Epsilon above 50 provides weak guarantees.
Noise multiplier directly controls how much noise gets added per step. Higher values give better (lower) epsilon but hurt convergence.

You can also target a specific epsilon and let Opacus calculate the required noise multiplier:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
model, optimizer, train_loader = privacy_engine.make_private_with_epsilon(
    module=model,
    optimizer=optimizer,
    data_loader=train_loader,
    epochs=10,
    target_epsilon=3.0,
    target_delta=1e-5,
    max_grad_norm=1.0,
)
print(f"Using noise_multiplier={optimizer.noise_multiplier}")

This is often the better approach – you pick your privacy target and Opacus figures out the noise.

Fixing Incompatible Layers

Not every PyTorch module works with Opacus. BatchNorm is the most common offender because it computes statistics across the batch, meaning one sample’s normalized value depends on other samples in the same batch. That violates the per-sample isolation that DP requires.

You’ll hit this error when you try to use a model with BatchNorm:

1
2
opacus.validators.errors.UnsupportedModuleError: Model contains a trainable
layer that Opacus doesn't currently support: BatchNorm2d

Opacus provides ModuleValidator to detect and auto-fix these issues:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
from opacus.validators import ModuleValidator

# Check if the model is compatible
errors = ModuleValidator.validate(model, strict=False)
if errors:
    print(f"Found {len(errors)} incompatible modules, fixing...")
    model = ModuleValidator.fix(model)

# IMPORTANT: create optimizer AFTER fixing the model
# ModuleValidator.fix() replaces layers, so old parameter references break
optimizer = optim.Adam(model.parameters(), lr=1e-3)

ModuleValidator.fix() replaces BatchNorm with GroupNorm automatically. The critical mistake people make is calling fix() after creating the optimizer. Since fix() swaps out layers, the optimizer’s parameter references become stale. Always fix first, then create your optimizer.

Tuning for Better Accuracy

DP training will always sacrifice some accuracy compared to non-private training. On CIFAR-10, expect roughly 60-65% accuracy with reasonable privacy guarantees versus 76%+ without DP. That said, bad hyperparameters make the gap much worse than it needs to be.

The order of tuning importance:

max_grad_norm – Start with 1.0, then sweep [0.1, 0.5, 1.0, 1.5, 2.0]. Train without noise first (set noise_multiplier=0.0) and watch what fraction of gradients get clipped. You want roughly 50-80% of gradients clipped. Too low and you’re destroying information; too high and the clipping does nothing useful.
noise_multiplier – Lower means less noise and better accuracy but weaker privacy. Start at 0.1 (barely private) to sanity-check your pipeline works, then increase.
Learning rate – DP training usually needs a lower learning rate than standard training because gradients are noisier. If you train non-privately at 1e-3, try 5e-4 or 1e-4 with DP.
Batch size – Larger batches help because the noise is fixed per batch, so more samples means better signal-to-noise ratio. But larger batches also consume more privacy budget per step, so train for fewer epochs.

Handling Memory Pressure

Per-sample gradient computation is the most expensive part of DP training. Opacus needs to store an individual gradient for every sample in the batch rather than just the averaged gradient. For large models, this blows up memory fast.

Use BatchMemoryManager to decouple the logical batch size (for privacy accounting) from the physical batch size (what fits in GPU memory):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
from opacus.utils.batch_memory_manager import BatchMemoryManager

LOGICAL_BATCH = 512
PHYSICAL_BATCH = 64

with BatchMemoryManager(
    data_loader=train_loader,
    max_physical_batch_size=PHYSICAL_BATCH,
    optimizer=optimizer,
) as memory_safe_loader:
    for data, target in memory_safe_loader:
        optimizer.zero_grad()
        output = model(data)
        loss = nn.CrossEntropyLoss()(output, target)
        loss.backward()
        optimizer.step()

This accumulates gradients across multiple physical steps before applying the noisy update, giving you the privacy benefits of large batches without running out of VRAM.

Common Pitfalls

Reusing a spent privacy budget. Every call to optimizer.step() consumes privacy budget. If you save a checkpoint and resume training, you need to account for the epsilon already spent. Opacus tracks this via privacy_engine.accountant, but if you restart from scratch, that state is lost.

Forgetting to set delta. If you never call get_epsilon(delta=...), you’re not actually checking your privacy guarantee. Log epsilon every epoch and set a hard limit – stop training if epsilon exceeds your threshold.

Evaluating on training data. DP protects the training data, not the test data. Your test accuracy is still a valid metric and doesn’t consume any privacy budget. Only forward/backward passes on the training set count.

Using dropout with DP. Dropout works fine with Opacus, but be aware it interacts with the noise – the effective noise per active parameter increases when some are dropped. This isn’t a bug, but it changes the accuracy/privacy tradeoff in ways that are hard to predict analytically. Test empirically.

Wrap Your Training Loop with PrivacyEngine#

Understanding Epsilon, Delta, and the Privacy Budget#

Fixing Incompatible Layers#

Tuning for Better Accuracy#

Handling Memory Pressure#

Common Pitfalls#

Related Guides#

About the Author