How to Build Adversarial Test Suites for ML Models

The Quick Version

Adversarial testing finds inputs that break your model — subtle changes to images that flip predictions, text rephrasing that confuses classifiers, or edge cases that expose blind spots. Building these test suites before deployment catches failures that standard test sets miss.

1
pip install foolbox torch torchvision numpy

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import torch
import foolbox as fb
from torchvision.models import resnet50, ResNet50_Weights

# Load a pretrained model
model = resnet50(weights=ResNet50_Weights.IMAGENET1K_V2).eval()
preprocessing = dict(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225], axis=-3)

fmodel = fb.PyTorchModel(model, bounds=(0, 1), preprocessing=preprocessing)

# Create a test image (normally you'd load real images)
images = torch.rand(1, 3, 224, 224)  # batch of 1 image
labels = torch.tensor([963])          # original class prediction

# Run FGSM attack — adds tiny perturbation to flip the prediction
attack = fb.attacks.FGSM()
raw, clipped, is_adv = attack(fmodel, images, labels, epsilons=[0.01, 0.03, 0.1])

for eps, adv, success in zip([0.01, 0.03, 0.1], clipped, is_adv):
    if success.item():
        perturbation = (adv - images).abs().max().item()
        print(f"eps={eps}: Attack succeeded (max perturbation: {perturbation:.4f})")
    else:
        print(f"eps={eps}: Model survived")

Foolbox tries progressively larger perturbations until the model’s prediction changes. If your model flips its prediction with a perturbation of 0.01 (invisible to humans), it’s vulnerable.

Building a Text Adversarial Suite

For NLP models, adversarial testing means finding rephrasings, typos, and character swaps that change predictions without changing meaning.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
import random
import string

class TextAdversarialSuite:
    """Generate adversarial text variants to test NLP model robustness."""

    def __init__(self, model_fn):
        """model_fn: function that takes text and returns a prediction dict."""
        self.model_fn = model_fn
        self.results = []

    def typo_attack(self, text: str, n_variants: int = 10) -> list[dict]:
        """Introduce random typos and check if prediction changes."""
        original_pred = self.model_fn(text)
        variants = []

        for _ in range(n_variants):
            words = text.split()
            idx = random.randint(0, len(words) - 1)
            word = list(words[idx])
            if len(word) > 2:
                pos = random.randint(1, len(word) - 1)
                word[pos] = random.choice(string.ascii_lowercase)
            words[idx] = "".join(word)
            variant = " ".join(words)

            new_pred = self.model_fn(variant)
            variants.append({
                "original": text,
                "variant": variant,
                "original_prediction": original_pred,
                "new_prediction": new_pred,
                "flipped": original_pred["label"] != new_pred["label"],
            })

        return variants

    def synonym_attack(self, text: str, replacements: dict[str, list[str]]) -> list[dict]:
        """Replace words with synonyms and check if prediction changes."""
        original_pred = self.model_fn(text)
        variants = []

        for word, synonyms in replacements.items():
            if word.lower() in text.lower():
                for syn in synonyms:
                    variant = text.replace(word, syn)
                    new_pred = self.model_fn(variant)
                    variants.append({
                        "original": text,
                        "variant": variant,
                        "replacement": f"{word} → {syn}",
                        "flipped": original_pred["label"] != new_pred["label"],
                    })

        return variants

    def case_attack(self, text: str) -> list[dict]:
        """Test sensitivity to capitalization changes."""
        original_pred = self.model_fn(text)
        variants = [
            ("lowercase", text.lower()),
            ("uppercase", text.upper()),
            ("title_case", text.title()),
            ("random_case", "".join(
                c.upper() if random.random() > 0.5 else c.lower() for c in text
            )),
        ]

        return [
            {
                "transform": name,
                "variant": var,
                "flipped": original_pred["label"] != self.model_fn(var)["label"],
            }
            for name, var in variants
        ]

# Usage with any model
def my_sentiment_model(text: str) -> dict:
    # Your actual model inference here
    return {"label": "positive", "score": 0.92}

suite = TextAdversarialSuite(my_sentiment_model)
results = suite.typo_attack("This product is absolutely wonderful and exceeded my expectations")

flipped = sum(1 for r in results if r["flipped"])
print(f"Typo attack: {flipped}/{len(results)} predictions flipped")

Systematic Edge Case Generation

Beyond random perturbations, test specific categories of edge cases that commonly break models:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
from dataclasses import dataclass

@dataclass
class TestCase:
    category: str
    input_text: str
    expected_behavior: str

def generate_edge_cases(base_text: str) -> list[TestCase]:
    """Generate systematic edge cases from a base input."""
    return [
        # Length extremes
        TestCase("empty_input", "", "Should handle gracefully, not crash"),
        TestCase("single_char", "a", "Should return low confidence"),
        TestCase("very_long", base_text * 100, "Should truncate or handle, not OOM"),

        # Special characters
        TestCase("unicode", "This is gr\u00e9at \u2764\ufe0f", "Should handle unicode"),
        TestCase("html_injection", "<script>alert('xss')</script> good product",
                 "Should not execute, classify normally"),
        TestCase("sql_injection", "'; DROP TABLE users; -- great item",
                 "Should classify normally"),
        TestCase("null_bytes", "Great product\x00hidden text", "Should handle null bytes"),

        # Formatting variations
        TestCase("extra_spaces", "This   is    really     good", "Same as normalized"),
        TestCase("newlines", "This is\nreally\ngood", "Same as single line"),
        TestCase("tabs", "This\tis\treally\tgood", "Same as spaces"),

        # Semantic edge cases
        TestCase("negation", f"Not {base_text.lower()}", "Should detect negation"),
        TestCase("sarcasm", "Oh sure, this is just AMAZING /s", "Challenging — note in results"),
        TestCase("mixed_sentiment", "The food was great but the service was terrible",
                 "Should handle mixed signals"),
        TestCase("neutral", "The package arrived on Tuesday.", "Should be neutral/low confidence"),
    ]

def run_edge_case_suite(model_fn, base_text: str) -> dict:
    """Run all edge cases and report results."""
    cases = generate_edge_cases(base_text)
    results = {"passed": 0, "failed": 0, "errors": 0, "details": []}

    for case in cases:
        try:
            pred = model_fn(case.input_text)
            results["details"].append({
                "category": case.category,
                "input_preview": case.input_text[:50],
                "prediction": pred,
                "expected": case.expected_behavior,
                "status": "ok",
            })
            results["passed"] += 1
        except Exception as e:
            results["details"].append({
                "category": case.category,
                "error": str(e),
                "status": "error",
            })
            results["errors"] += 1

    return results

report = run_edge_case_suite(my_sentiment_model, "This product is wonderful")
print(f"Passed: {report['passed']}, Errors: {report['errors']}")
for detail in report["details"]:
    if detail["status"] == "error":
        print(f"  FAIL [{detail['category']}]: {detail['error']}")

Image Robustness Testing

Test how your vision model handles real-world degradation — not just adversarial attacks, but practical issues like blur, noise, and compression.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
import torch
import torchvision.transforms as T
from PIL import Image, ImageFilter
import numpy as np

def image_robustness_suite(model, image: Image.Image, true_label: int) -> dict:
    """Test a vision model against common image corruptions."""
    base_transform = T.Compose([T.Resize(256), T.CenterCrop(224), T.ToTensor()])

    corruptions = {
        "original": lambda img: img,
        "gaussian_blur": lambda img: img.filter(ImageFilter.GaussianBlur(radius=3)),
        "jpeg_quality_10": lambda img: _jpeg_compress(img, quality=10),
        "brightness_high": lambda img: T.functional.adjust_brightness(img, 2.0),
        "brightness_low": lambda img: T.functional.adjust_brightness(img, 0.3),
        "rotation_15": lambda img: img.rotate(15),
        "horizontal_flip": lambda img: img.transpose(Image.FLIP_LEFT_RIGHT),
        "center_crop_50pct": lambda img: T.CenterCrop(min(img.size) // 2)(img),
        "gaussian_noise": lambda img: _add_noise(img, std=0.1),
    }

    results = {}
    model.eval()

    for name, transform in corruptions.items():
        try:
            corrupted = transform(image)
            tensor = base_transform(corrupted).unsqueeze(0)

            with torch.no_grad():
                output = model(tensor)
                pred = output.argmax(dim=1).item()
                confidence = torch.softmax(output, dim=1).max().item()

            results[name] = {
                "prediction": pred,
                "confidence": round(confidence, 4),
                "correct": pred == true_label,
            }
        except Exception as e:
            results[name] = {"error": str(e)}

    return results

def _jpeg_compress(img: Image.Image, quality: int) -> Image.Image:
    from io import BytesIO
    buffer = BytesIO()
    img.save(buffer, format="JPEG", quality=quality)
    buffer.seek(0)
    return Image.open(buffer).convert("RGB")

def _add_noise(img: Image.Image, std: float) -> Image.Image:
    arr = np.array(img).astype(np.float32) / 255.0
    noise = np.random.normal(0, std, arr.shape)
    noisy = np.clip(arr + noise, 0, 1) * 255
    return Image.fromarray(noisy.astype(np.uint8))

Generating a Test Report

Combine all tests into a single report that tracks model robustness over time:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import json
from datetime import datetime

def generate_robustness_report(
    model_name: str,
    text_results: dict,
    image_results: dict = None,
) -> dict:
    """Generate a structured robustness report."""
    report = {
        "model": model_name,
        "timestamp": datetime.now().isoformat(),
        "text_robustness": {
            "total_tests": text_results["passed"] + text_results["errors"],
            "passed": text_results["passed"],
            "errors": text_results["errors"],
            "error_rate": text_results["errors"] / max(
                text_results["passed"] + text_results["errors"], 1
            ),
            "details": text_results["details"],
        },
    }

    if image_results:
        correct = sum(1 for r in image_results.values() if r.get("correct"))
        total = len(image_results)
        report["image_robustness"] = {
            "total_corruptions": total,
            "survived": correct,
            "survival_rate": correct / max(total, 1),
            "details": image_results,
        }

    return report

report = generate_robustness_report("sentiment-v2", report)
with open(f"robustness_report_{datetime.now().strftime('%Y%m%d')}.json", "w") as f:
    json.dump(report, f, indent=2)
print(f"Report saved. Survival rate: {report.get('image_robustness', {}).get('survival_rate', 'N/A')}")

Common Errors and Fixes

Adversarial attacks take forever on large models

Use attack budgets. Set steps=20 on iterative attacks instead of letting them run until convergence. For quick screening, FGSM (single-step) is 100x faster than PGD (multi-step) and catches the most obvious vulnerabilities.

Model crashes on edge case inputs instead of handling them

This is the most important finding. Wrap inference in try/except and add input validation. Any input that crashes your model is a security risk — attackers will find it.

False sense of security from passing adversarial tests

Passing known attacks doesn’t mean the model is robust — just that it resists those specific attacks. New attacks are published monthly. Treat adversarial testing as a continuous process, not a one-time gate.

Generated adversarial examples look obviously wrong

If your adversarial perturbation is visible to humans, decrease epsilon. The point is to find inputs that look normal to humans but fool the model. Set epsilon below the perceptual threshold (0.01-0.03 for normalized images).

Testing takes too long for CI/CD

Create a “smoke test” subset: 10-20 critical edge cases that run in under 30 seconds. Save the full suite (1000+ cases) for nightly or weekly runs. Tag test results with model version so you can track regressions.

The Quick Version#

Building a Text Adversarial Suite#

Systematic Edge Case Generation#

Image Robustness Testing#

Generating a Test Report#

Common Errors and Fixes#

Related Guides#

About the Author