How to Build Adversarial Robustness Testing for Vision Models

The Quick Setup

A model that hits 95% accuracy on clean test data can drop to 5% when an attacker adds imperceptible noise to the input images. You need to find this out before deployment, not after. The Adversarial Robustness Toolbox (ART) from IBM wraps your PyTorch model and throws well-studied attacks at it so you can measure exactly how fragile it is.

1
pip install adversarial-robustness-toolbox torch torchvision numpy matplotlib

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import torch
import numpy as np
from torchvision.models import resnet50, ResNet50_Weights
from art.estimators.classification import PyTorchClassifier
from art.attacks.evasion import FastGradientMethod

# Load pretrained ResNet-50
model = resnet50(weights=ResNet50_Weights.IMAGENET1K_V2)
model.eval()

# Wrap with ART's PyTorchClassifier
criterion = torch.nn.CrossEntropyLoss()
classifier = PyTorchClassifier(
    model=model,
    clip_values=(0.0, 1.0),
    loss=criterion,
    optimizer=None,  # not needed for inference-only testing
    input_shape=(3, 224, 224),
    nb_classes=1000,
)

# Create a test image (normally you'd load real images)
# Random image normalized to [0, 1]
x_test = np.random.rand(1, 3, 224, 224).astype(np.float32)
predictions = classifier.predict(x_test)
original_class = np.argmax(predictions, axis=1)[0]
original_confidence = predictions[0][original_class]
print(f"Original prediction: class {original_class}, confidence: {original_confidence:.4f}")

# Run FGSM attack with epsilon=0.03
attack = FastGradientMethod(estimator=classifier, eps=0.03)
x_adv = attack.generate(x=x_test)

adv_predictions = classifier.predict(x_adv)
adv_class = np.argmax(adv_predictions, axis=1)[0]
adv_confidence = adv_predictions[0][adv_class]
print(f"Adversarial prediction: class {adv_class}, confidence: {adv_confidence:.4f}")
print(f"Prediction flipped: {original_class != adv_class}")

That’s the core loop. Wrap your model, pick an attack, generate adversarial examples, measure the damage. Everything below builds on this pattern.

Running FGSM and PGD Attacks at Multiple Epsilon Values

FGSM (Fast Gradient Sign Method) is a single-step attack – fast but not the strongest. PGD (Projected Gradient Descent) iterates multiple steps and finds more effective perturbations within the same epsilon budget. You should test both.

Epsilon controls how much the attacker can change each pixel. For images normalized to [0, 1], common test budgets are:

eps=0.01 – barely visible, should not fool a robust model
eps=0.03 – subtle noise, this is the standard benchmark threshold
eps=0.1 – noticeable artifacts, most models fail here
eps=0.3 – clearly distorted, useful as a sanity check upper bound

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
import numpy as np
from torchvision.models import resnet50, ResNet50_Weights
from torchvision import transforms
from PIL import Image
import torch
from art.estimators.classification import PyTorchClassifier
from art.attacks.evasion import FastGradientMethod, ProjectedGradientDescent

# Load model
model = resnet50(weights=ResNet50_Weights.IMAGENET1K_V2)
model.eval()

criterion = torch.nn.CrossEntropyLoss()
classifier = PyTorchClassifier(
    model=model,
    clip_values=(0.0, 1.0),
    loss=criterion,
    optimizer=None,
    input_shape=(3, 224, 224),
    nb_classes=1000,
)

# Load and preprocess a batch of test images
# Using ImageNet normalization -- convert to [0,1] range for ART
preprocess = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),  # converts to [0, 1]
])

# For demo: create synthetic test batch (replace with real images)
n_samples = 50
x_test = np.random.rand(n_samples, 3, 224, 224).astype(np.float32)
y_pred = classifier.predict(x_test)
y_test = np.argmax(y_pred, axis=1)

epsilons = [0.01, 0.03, 0.05, 0.1, 0.3]

print("Attack        | Epsilon | Accuracy | Avg L-inf Perturbation")
print("-" * 62)

for eps in epsilons:
    # FGSM -- single step
    fgsm = FastGradientMethod(estimator=classifier, eps=eps)
    x_fgsm = fgsm.generate(x=x_test)
    fgsm_preds = np.argmax(classifier.predict(x_fgsm), axis=1)
    fgsm_acc = np.mean(fgsm_preds == y_test) * 100
    fgsm_pert = np.max(np.abs(x_fgsm - x_test))

    # PGD -- 40 steps, step size = eps/4
    pgd = ProjectedGradientDescent(
        estimator=classifier,
        eps=eps,
        eps_step=eps / 4,
        max_iter=40,
        num_random_init=1,
    )
    x_pgd = pgd.generate(x=x_test)
    pgd_preds = np.argmax(classifier.predict(x_pgd), axis=1)
    pgd_acc = np.mean(pgd_preds == y_test) * 100
    pgd_pert = np.max(np.abs(x_pgd - x_test))

    print(f"FGSM          | {eps:.2f}    | {fgsm_acc:5.1f}%   | {fgsm_pert:.4f}")
    print(f"PGD (40-step) | {eps:.2f}    | {pgd_acc:5.1f}%   | {pgd_pert:.4f}")

Typical results on a standard ResNet-50 against ImageNet validation samples look roughly like this:

Attack	eps=0.01	eps=0.03	eps=0.1
FGSM	~52% acc	~25% acc	~3% acc
PGD-40	~38% acc	~8% acc	~0% acc

PGD always does more damage at the same epsilon because it optimizes the perturbation over multiple iterations. If your model survives PGD at eps=0.03, it has meaningful robustness. If FGSM alone drops it to single digits at eps=0.01, you have a serious problem.

Visualizing Original vs Adversarial Images

Numbers alone do not tell the full story. Seeing the adversarial perturbations helps you understand what the attack is exploiting and whether the perturbation budget is reasonable.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
import matplotlib.pyplot as plt
import numpy as np
from art.estimators.classification import PyTorchClassifier
from art.attacks.evasion import ProjectedGradientDescent
from torchvision.models import resnet50, ResNet50_Weights
import torch

# Setup (reuse classifier from above)
model = resnet50(weights=ResNet50_Weights.IMAGENET1K_V2)
model.eval()

classifier = PyTorchClassifier(
    model=model,
    clip_values=(0.0, 1.0),
    loss=torch.nn.CrossEntropyLoss(),
    optimizer=None,
    input_shape=(3, 224, 224),
    nb_classes=1000,
)

# Single test image
x_test = np.random.rand(1, 3, 224, 224).astype(np.float32)
y_original = np.argmax(classifier.predict(x_test), axis=1)

# Generate adversarial example
pgd = ProjectedGradientDescent(
    estimator=classifier, eps=0.03, eps_step=0.007, max_iter=40
)
x_adv = pgd.generate(x=x_test)
y_adv = np.argmax(classifier.predict(x_adv), axis=1)

# Compute perturbation (amplified 10x for visibility)
perturbation = x_adv - x_test

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Original image -- transpose from (C, H, W) to (H, W, C)
axes[0].imshow(np.clip(x_test[0].transpose(1, 2, 0), 0, 1))
axes[0].set_title(f"Original (class {y_original[0]})")
axes[0].axis("off")

# Perturbation amplified for visibility
pert_display = perturbation[0].transpose(1, 2, 0)
pert_display = (pert_display - pert_display.min()) / (pert_display.max() - pert_display.min())
axes[1].imshow(pert_display)
axes[1].set_title(f"Perturbation (10x, L-inf={np.max(np.abs(perturbation)):.4f})")
axes[1].axis("off")

# Adversarial image
axes[2].imshow(np.clip(x_adv[0].transpose(1, 2, 0), 0, 1))
axes[2].set_title(f"Adversarial (class {y_adv[0]})")
axes[2].axis("off")

plt.tight_layout()
plt.savefig("adversarial_comparison.png", dpi=150, bbox_inches="tight")
plt.show()

print(f"Max pixel change: {np.max(np.abs(perturbation)):.4f}")
print(f"Mean pixel change: {np.mean(np.abs(perturbation)):.4f}")
print(f"Prediction changed: {y_original[0] != y_adv[0]}")

At eps=0.03, the original and adversarial images look identical to the human eye. The perturbation panel shows structured noise that concentrates on edges and high-frequency regions – the attack exploits the features the model relies on most heavily.

Adversarial Training as Defense

The most effective defense against adversarial attacks is adversarial training: you generate adversarial examples during training and include them in each batch. This forces the model to learn features that are robust to small perturbations rather than relying on brittle patterns.

ART provides AdversarialTrainer that handles this automatically. You give it a classifier and an attack, and it augments each training batch with adversarial examples.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
import torch
import torch.nn as nn
import numpy as np
from torchvision.models import resnet50, ResNet50_Weights
from art.estimators.classification import PyTorchClassifier
from art.attacks.evasion import ProjectedGradientDescent
from art.defences.trainer import AdversarialTrainer

# Load pretrained model for fine-tuning
model = resnet50(weights=ResNet50_Weights.IMAGENET1K_V2)
model.train()

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)

# Wrap model -- optimizer is required for training
classifier = PyTorchClassifier(
    model=model,
    clip_values=(0.0, 1.0),
    loss=criterion,
    optimizer=optimizer,
    input_shape=(3, 224, 224),
    nb_classes=1000,
)

# Define the attack used during training
# Use weaker PGD (7 steps) to keep training tractable
pgd_attack = ProjectedGradientDescent(
    estimator=classifier,
    eps=0.03,
    eps_step=0.007,
    max_iter=7,
    num_random_init=1,
)

# Create adversarial trainer
# ratio=0.5 means half of each batch is adversarial, half is clean
adv_trainer = AdversarialTrainer(
    classifier=classifier,
    attacks=pgd_attack,
    ratio=0.5,
)

# Prepare training data (replace with your real dataset)
n_train = 500
x_train = np.random.rand(n_train, 3, 224, 224).astype(np.float32)
y_train_logits = classifier.predict(x_train)
y_train = np.eye(1000)[np.argmax(y_train_logits, axis=1)].astype(np.float32)

# Run adversarial training
adv_trainer.fit(
    x=x_train,
    y=y_train,
    batch_size=16,
    nb_epochs=10,
)

# Evaluate robustness after adversarial training
n_eval = 100
x_eval = np.random.rand(n_eval, 3, 224, 224).astype(np.float32)
y_eval = np.argmax(classifier.predict(x_eval), axis=1)

# Test with PGD attack
pgd_eval = ProjectedGradientDescent(
    estimator=classifier, eps=0.03, eps_step=0.007, max_iter=40
)
x_eval_adv = pgd_eval.generate(x=x_eval)
y_eval_adv = np.argmax(classifier.predict(x_eval_adv), axis=1)

robust_acc = np.mean(y_eval_adv == y_eval) * 100
clean_acc_check = np.mean(np.argmax(classifier.predict(x_eval), axis=1) == y_eval) * 100
print(f"Clean accuracy after adv training: {clean_acc_check:.1f}%")
print(f"Robust accuracy (PGD eps=0.03): {robust_acc:.1f}%")

A few things to know about adversarial training. It is expensive – each training step requires generating adversarial examples, which means running the attack’s forward and backward passes on top of normal training. Using 7-step PGD instead of 40-step PGD during training is a common compromise; it is roughly 80% as effective at much lower cost. Expect a small drop in clean accuracy (2-5%) as a tradeoff for significantly higher robust accuracy. On ImageNet-scale models, adversarial training at eps=0.03 typically improves PGD-40 robust accuracy from under 10% to around 30-45%.

Putting It All Together: A Robustness Report

Here is a complete testing pipeline that wraps everything into a reusable function and outputs a structured report.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
import json
import numpy as np
from datetime import datetime
from art.estimators.classification import PyTorchClassifier
from art.attacks.evasion import FastGradientMethod, ProjectedGradientDescent
from torchvision.models import resnet50, ResNet50_Weights
import torch

def run_robustness_audit(
    classifier: PyTorchClassifier,
    x_test: np.ndarray,
    y_test: np.ndarray,
    epsilons: list[float] = None,
    pgd_steps: int = 40,
) -> dict:
    """Run a full adversarial robustness audit on a classifier."""
    if epsilons is None:
        epsilons = [0.005, 0.01, 0.03, 0.05, 0.1]

    # Clean accuracy baseline
    clean_preds = np.argmax(classifier.predict(x_test), axis=1)
    clean_acc = np.mean(clean_preds == y_test) * 100

    results = {
        "model": "ResNet-50 (ImageNet V2 weights)",
        "timestamp": datetime.now().isoformat(),
        "n_samples": len(x_test),
        "clean_accuracy": round(clean_acc, 2),
        "attacks": [],
    }

    for eps in epsilons:
        # FGSM
        fgsm = FastGradientMethod(estimator=classifier, eps=eps)
        x_fgsm = fgsm.generate(x=x_test)
        fgsm_acc = np.mean(np.argmax(classifier.predict(x_fgsm), axis=1) == y_test) * 100

        # PGD
        pgd = ProjectedGradientDescent(
            estimator=classifier,
            eps=eps,
            eps_step=eps / 4,
            max_iter=pgd_steps,
            num_random_init=1,
        )
        x_pgd = pgd.generate(x=x_test)
        pgd_acc = np.mean(np.argmax(classifier.predict(x_pgd), axis=1) == y_test) * 100

        results["attacks"].append({
            "epsilon": eps,
            "fgsm_accuracy": round(fgsm_acc, 2),
            "fgsm_accuracy_drop": round(clean_acc - fgsm_acc, 2),
            "pgd_accuracy": round(pgd_acc, 2),
            "pgd_accuracy_drop": round(clean_acc - pgd_acc, 2),
            "pgd_steps": pgd_steps,
        })

    # Save report
    filename = f"robustness_report_{datetime.now().strftime('%Y%m%d_%H%M%S')}.json"
    with open(filename, "w") as f:
        json.dump(results, f, indent=2)
    print(f"Report saved to {filename}")

    return results

# Usage
model = resnet50(weights=ResNet50_Weights.IMAGENET1K_V2)
model.eval()

classifier = PyTorchClassifier(
    model=model,
    clip_values=(0.0, 1.0),
    loss=torch.nn.CrossEntropyLoss(),
    optimizer=None,
    input_shape=(3, 224, 224),
    nb_classes=1000,
)

# Load your test set (replace with real ImageNet validation samples)
n_test = 100
x_test = np.random.rand(n_test, 3, 224, 224).astype(np.float32)
y_test = np.argmax(classifier.predict(x_test), axis=1)

report = run_robustness_audit(classifier, x_test, y_test)

# Print summary
print(f"\nClean accuracy: {report['clean_accuracy']:.1f}%")
for attack in report["attacks"]:
    print(f"eps={attack['epsilon']:.3f} | FGSM: {attack['fgsm_accuracy']:.1f}% | PGD-40: {attack['pgd_accuracy']:.1f}%")

Run this before every model deployment. If PGD accuracy at eps=0.03 drops below your threshold (a reasonable starting point is 30% for non-critical applications, 50%+ for safety-critical ones), the model fails the robustness gate and needs adversarial training or architectural changes.

Common Errors and Fixes

ValueError: The input shape (3, 224, 224) does not match the expected shape – ART expects numpy arrays in NCHW format (batch, channels, height, width) with values in the range specified by clip_values. If you pass a PIL image or a tensor with values in [0, 255], the attack silently produces garbage. Always convert to float32 in [0, 1]:

1
2
3
x = np.array(image).astype(np.float32) / 255.0
x = x.transpose(2, 0, 1)  # HWC to CHW
x = np.expand_dims(x, axis=0)  # add batch dimension

RuntimeError: expected scalar type Float but found Double – ART passes numpy arrays to PyTorch. If your input is float64, PyTorch rejects it. Cast explicitly: x_test = x_test.astype(np.float32).

PGD attack runs extremely slowly on CPU – PGD with 40 iterations on ResNet-50 runs about 50x slower on CPU than GPU. Move the model to GPU before wrapping it with ART:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
classifier = PyTorchClassifier(
    model=model,
    clip_values=(0.0, 1.0),
    loss=criterion,
    optimizer=None,
    input_shape=(3, 224, 224),
    nb_classes=1000,
    device_type="gpu",  # tells ART to use CUDA
)

TypeError: 'NoneType' object is not callable when using AdversarialTrainer – You forgot to pass an optimizer to PyTorchClassifier. Inference-only testing works without an optimizer, but adversarial training requires one. Pass optimizer=torch.optim.SGD(model.parameters(), lr=0.001).

Clean accuracy drops significantly after adversarial training – This is expected but should be bounded. If clean accuracy drops more than 5-8%, reduce the ratio parameter in AdversarialTrainer from 0.5 to 0.3, or lower the training epsilon. You are trading clean performance for robustness – find the balance that fits your use case.

OutOfMemoryError when generating adversarial examples for large batches – PGD stores intermediate gradients. Reduce batch size to 8 or 4 for attack generation, or use batch_size parameter in the attack constructor: ProjectedGradientDescent(estimator=classifier, eps=0.03, eps_step=0.007, max_iter=40, batch_size=8).

The Quick Setup#

Running FGSM and PGD Attacks at Multiple Epsilon Values#

Visualizing Original vs Adversarial Images#

Adversarial Training as Defense#

Putting It All Together: A Robustness Report#

Common Errors and Fixes#

Related Guides#

About the Author