How to Detect Anomalies in Images with Vision Models

The Quick Version

Visual anomaly detection finds things that look “wrong” in images — manufacturing defects, damaged goods, unusual patterns. The key insight: you don’t need labeled examples of every possible defect. You train on “normal” images only, and the model flags anything that doesn’t match the learned normal distribution.

1
pip install torch torchvision anomalib

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
from anomalib.data import MVTec
from anomalib.models import Patchcore
from anomalib.engine import Engine

# Download MVTec dataset (standard anomaly detection benchmark)
datamodule = MVTec(
    root="./datasets",
    category="bottle",    # bottles, cables, capsules, etc.
    image_size=(256, 256),
    train_batch_size=32,
)

# PatchCore: state-of-the-art, no training needed (memory bank approach)
model = Patchcore()

engine = Engine()
engine.fit(model=model, datamodule=datamodule)

# Test on anomalous images
results = engine.test(model=model, datamodule=datamodule)
print(f"Image-level AUROC: {results[0]['image_AUROC']:.3f}")
print(f"Pixel-level AUROC: {results[0]['pixel_AUROC']:.3f}")

PatchCore typically hits 99%+ AUROC on MVTec without any training — it builds a memory bank of normal patch features during the fit phase and flags patches that differ at test time. That makes it ideal for manufacturing inspection where you have plenty of good samples but few defect examples.

Building an Autoencoder for Anomaly Detection

For custom datasets where you want more control, an autoencoder learns to reconstruct normal images. Anomalous images reconstruct poorly, and the reconstruction error becomes your anomaly score.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from torchvision import transforms, datasets

class AnomalyAutoencoder(nn.Module):
    def __init__(self, latent_dim: int = 128):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Conv2d(3, 32, 4, stride=2, padding=1),
            nn.ReLU(),
            nn.Conv2d(32, 64, 4, stride=2, padding=1),
            nn.ReLU(),
            nn.Conv2d(64, 128, 4, stride=2, padding=1),
            nn.ReLU(),
            nn.Flatten(),
            nn.Linear(128 * 32 * 32, latent_dim),
        )
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, 128 * 32 * 32),
            nn.Unflatten(1, (128, 32, 32)),
            nn.ConvTranspose2d(128, 64, 4, stride=2, padding=1),
            nn.ReLU(),
            nn.ConvTranspose2d(64, 32, 4, stride=2, padding=1),
            nn.ReLU(),
            nn.ConvTranspose2d(32, 3, 4, stride=2, padding=1),
            nn.Sigmoid(),
        )

    def forward(self, x):
        z = self.encoder(x)
        return self.decoder(z)

# Train on NORMAL images only
transform = transforms.Compose([
    transforms.Resize((256, 256)),
    transforms.ToTensor(),
])

normal_dataset = datasets.ImageFolder("data/train/good/", transform=transform)
train_loader = DataLoader(normal_dataset, batch_size=32, shuffle=True)

model = AnomalyAutoencoder().cuda()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.MSELoss()

for epoch in range(50):
    total_loss = 0
    for images, _ in train_loader:
        images = images.cuda()
        reconstructed = model(images)
        loss = criterion(reconstructed, images)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        total_loss += loss.item()

    print(f"Epoch {epoch+1}: Loss = {total_loss / len(train_loader):.6f}")

Scoring New Images

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
def anomaly_score(model, image_tensor: torch.Tensor) -> tuple[float, torch.Tensor]:
    """Compute anomaly score and pixel-level error map."""
    model.eval()
    with torch.no_grad():
        image = image_tensor.unsqueeze(0).cuda()
        reconstructed = model(image)
        error_map = (image - reconstructed).pow(2).mean(dim=1)  # per-pixel MSE
        score = error_map.mean().item()
    return score, error_map.squeeze().cpu()

# Score a test image
from PIL import Image

test_image = transform(Image.open("test_sample.png"))
score, error_map = anomaly_score(model, test_image)
print(f"Anomaly score: {score:.6f}")

# Set threshold from validation set of normal images
threshold = 0.005  # tune this on your data
if score > threshold:
    print("ANOMALY DETECTED")

The error map shows you exactly where the anomaly is — pixels with high reconstruction error are likely defective regions.

Zero-Shot Anomaly Detection with CLIP

CLIP can detect anomalies without any training on your specific domain. You compare images against text descriptions of “normal” and “defective” to get an anomaly score.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import torch
import clip
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

def clip_anomaly_score(image_path: str, normal_desc: str, anomaly_desc: str) -> dict:
    """Score how anomalous an image is using CLIP text-image similarity."""
    image = preprocess(Image.open(image_path)).unsqueeze(0).to(device)
    text = clip.tokenize([normal_desc, anomaly_desc]).to(device)

    with torch.no_grad():
        image_features = model.encode_image(image)
        text_features = model.encode_text(text)

        # Cosine similarity
        image_features /= image_features.norm(dim=-1, keepdim=True)
        text_features /= text_features.norm(dim=-1, keepdim=True)
        similarities = (image_features @ text_features.T).squeeze()

    normal_sim = similarities[0].item()
    anomaly_sim = similarities[1].item()

    return {
        "normal_score": normal_sim,
        "anomaly_score": anomaly_sim,
        "is_anomaly": anomaly_sim > normal_sim,
        "confidence": abs(anomaly_sim - normal_sim),
    }

# Example: manufacturing inspection
result = clip_anomaly_score(
    "test_bottle.png",
    normal_desc="a photo of a normal bottle without defects",
    anomaly_desc="a photo of a bottle with cracks, dents, or damage",
)
print(f"Anomaly: {result['is_anomaly']} (confidence: {result['confidence']:.3f})")

CLIP-based detection works surprisingly well for obvious defects (cracks, missing parts, wrong colors). It struggles with subtle anomalies that require understanding fine-grained texture differences.

Setting Thresholds in Production

The hardest part of anomaly detection isn’t the model — it’s choosing the right threshold. Too sensitive and you get false alarms. Too lenient and you miss defects.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
import numpy as np
from sklearn.metrics import precision_recall_curve

def find_optimal_threshold(normal_scores: list[float], anomaly_scores: list[float]) -> float:
    """Find the threshold that maximizes F1 score."""
    scores = np.array(normal_scores + anomaly_scores)
    labels = np.array([0] * len(normal_scores) + [1] * len(anomaly_scores))

    precisions, recalls, thresholds = precision_recall_curve(labels, scores)
    f1_scores = 2 * (precisions * recalls) / (precisions + recalls + 1e-10)
    best_idx = np.argmax(f1_scores)

    return float(thresholds[best_idx])

# Collect scores from your validation set
normal_scores = [anomaly_score(model, img)[0] for img in normal_validation_images]
anomaly_scores = [anomaly_score(model, img)[0] for img in anomaly_validation_images]

threshold = find_optimal_threshold(normal_scores, anomaly_scores)
print(f"Optimal threshold: {threshold:.6f}")

In practice, run the model on 100+ normal images to establish a baseline distribution. Set the threshold at the 99th percentile of normal scores — this gives you a 1% false positive rate on normal images.

Common Errors and Fixes

High false positive rate on normal images

Your training set doesn’t cover enough normal variation. Add more diverse normal examples — different lighting, angles, positions. The model should see every kind of “normal” during training.

Model detects anomalies but can’t localize them

PatchCore and autoencoder error maps give pixel-level localization out of the box. If using CLIP, add GradCAM visualization to highlight which image regions drive the anomaly classification.

Anomalib fit is slow

PatchCore builds a memory bank from all training features. For large datasets, use coreset_sampling_ratio=0.1 to subsample:

1
model = Patchcore(coreset_sampling_ratio=0.1)

All scores are very close together

Your normal and anomalous images are too similar for the model’s resolution. Increase image size, use a deeper backbone (ResNet-50 instead of ResNet-18), or try a different approach like EfficientAD.

Model works on benchmark but fails on real data

Benchmark datasets have clean, centered images. Real production images have varying backgrounds, rotations, and scales. Add preprocessing to normalize your inputs — crop to the region of interest, standardize orientation, and match the lighting conditions used during training.

The Quick Version#

Building an Autoencoder for Anomaly Detection#

Scoring New Images#

Zero-Shot Anomaly Detection with CLIP#

Setting Thresholds in Production#

Common Errors and Fixes#

Related Guides#

About the Author