The Quick Version#
Visual anomaly detection finds things that look “wrong” in images — manufacturing defects, damaged goods, unusual patterns. The key insight: you don’t need labeled examples of every possible defect. You train on “normal” images only, and the model flags anything that doesn’t match the learned normal distribution.
1
| pip install torch torchvision anomalib
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
| from anomalib.data import MVTec
from anomalib.models import Patchcore
from anomalib.engine import Engine
# Download MVTec dataset (standard anomaly detection benchmark)
datamodule = MVTec(
root="./datasets",
category="bottle", # bottles, cables, capsules, etc.
image_size=(256, 256),
train_batch_size=32,
)
# PatchCore: state-of-the-art, no training needed (memory bank approach)
model = Patchcore()
engine = Engine()
engine.fit(model=model, datamodule=datamodule)
# Test on anomalous images
results = engine.test(model=model, datamodule=datamodule)
print(f"Image-level AUROC: {results[0]['image_AUROC']:.3f}")
print(f"Pixel-level AUROC: {results[0]['pixel_AUROC']:.3f}")
|
PatchCore typically hits 99%+ AUROC on MVTec without any training — it builds a memory bank of normal patch features during the fit phase and flags patches that differ at test time. That makes it ideal for manufacturing inspection where you have plenty of good samples but few defect examples.
Building an Autoencoder for Anomaly Detection#
For custom datasets where you want more control, an autoencoder learns to reconstruct normal images. Anomalous images reconstruct poorly, and the reconstruction error becomes your anomaly score.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
| import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from torchvision import transforms, datasets
class AnomalyAutoencoder(nn.Module):
def __init__(self, latent_dim: int = 128):
super().__init__()
self.encoder = nn.Sequential(
nn.Conv2d(3, 32, 4, stride=2, padding=1),
nn.ReLU(),
nn.Conv2d(32, 64, 4, stride=2, padding=1),
nn.ReLU(),
nn.Conv2d(64, 128, 4, stride=2, padding=1),
nn.ReLU(),
nn.Flatten(),
nn.Linear(128 * 32 * 32, latent_dim),
)
self.decoder = nn.Sequential(
nn.Linear(latent_dim, 128 * 32 * 32),
nn.Unflatten(1, (128, 32, 32)),
nn.ConvTranspose2d(128, 64, 4, stride=2, padding=1),
nn.ReLU(),
nn.ConvTranspose2d(64, 32, 4, stride=2, padding=1),
nn.ReLU(),
nn.ConvTranspose2d(32, 3, 4, stride=2, padding=1),
nn.Sigmoid(),
)
def forward(self, x):
z = self.encoder(x)
return self.decoder(z)
# Train on NORMAL images only
transform = transforms.Compose([
transforms.Resize((256, 256)),
transforms.ToTensor(),
])
normal_dataset = datasets.ImageFolder("data/train/good/", transform=transform)
train_loader = DataLoader(normal_dataset, batch_size=32, shuffle=True)
model = AnomalyAutoencoder().cuda()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.MSELoss()
for epoch in range(50):
total_loss = 0
for images, _ in train_loader:
images = images.cuda()
reconstructed = model(images)
loss = criterion(reconstructed, images)
optimizer.zero_grad()
loss.backward()
optimizer.step()
total_loss += loss.item()
print(f"Epoch {epoch+1}: Loss = {total_loss / len(train_loader):.6f}")
|
Scoring New Images#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
| def anomaly_score(model, image_tensor: torch.Tensor) -> tuple[float, torch.Tensor]:
"""Compute anomaly score and pixel-level error map."""
model.eval()
with torch.no_grad():
image = image_tensor.unsqueeze(0).cuda()
reconstructed = model(image)
error_map = (image - reconstructed).pow(2).mean(dim=1) # per-pixel MSE
score = error_map.mean().item()
return score, error_map.squeeze().cpu()
# Score a test image
from PIL import Image
test_image = transform(Image.open("test_sample.png"))
score, error_map = anomaly_score(model, test_image)
print(f"Anomaly score: {score:.6f}")
# Set threshold from validation set of normal images
threshold = 0.005 # tune this on your data
if score > threshold:
print("ANOMALY DETECTED")
|
The error map shows you exactly where the anomaly is — pixels with high reconstruction error are likely defective regions.
Zero-Shot Anomaly Detection with CLIP#
CLIP can detect anomalies without any training on your specific domain. You compare images against text descriptions of “normal” and “defective” to get an anomaly score.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
| import torch
import clip
from PIL import Image
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)
def clip_anomaly_score(image_path: str, normal_desc: str, anomaly_desc: str) -> dict:
"""Score how anomalous an image is using CLIP text-image similarity."""
image = preprocess(Image.open(image_path)).unsqueeze(0).to(device)
text = clip.tokenize([normal_desc, anomaly_desc]).to(device)
with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
# Cosine similarity
image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)
similarities = (image_features @ text_features.T).squeeze()
normal_sim = similarities[0].item()
anomaly_sim = similarities[1].item()
return {
"normal_score": normal_sim,
"anomaly_score": anomaly_sim,
"is_anomaly": anomaly_sim > normal_sim,
"confidence": abs(anomaly_sim - normal_sim),
}
# Example: manufacturing inspection
result = clip_anomaly_score(
"test_bottle.png",
normal_desc="a photo of a normal bottle without defects",
anomaly_desc="a photo of a bottle with cracks, dents, or damage",
)
print(f"Anomaly: {result['is_anomaly']} (confidence: {result['confidence']:.3f})")
|
CLIP-based detection works surprisingly well for obvious defects (cracks, missing parts, wrong colors). It struggles with subtle anomalies that require understanding fine-grained texture differences.
Setting Thresholds in Production#
The hardest part of anomaly detection isn’t the model — it’s choosing the right threshold. Too sensitive and you get false alarms. Too lenient and you miss defects.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
| import numpy as np
from sklearn.metrics import precision_recall_curve
def find_optimal_threshold(normal_scores: list[float], anomaly_scores: list[float]) -> float:
"""Find the threshold that maximizes F1 score."""
scores = np.array(normal_scores + anomaly_scores)
labels = np.array([0] * len(normal_scores) + [1] * len(anomaly_scores))
precisions, recalls, thresholds = precision_recall_curve(labels, scores)
f1_scores = 2 * (precisions * recalls) / (precisions + recalls + 1e-10)
best_idx = np.argmax(f1_scores)
return float(thresholds[best_idx])
# Collect scores from your validation set
normal_scores = [anomaly_score(model, img)[0] for img in normal_validation_images]
anomaly_scores = [anomaly_score(model, img)[0] for img in anomaly_validation_images]
threshold = find_optimal_threshold(normal_scores, anomaly_scores)
print(f"Optimal threshold: {threshold:.6f}")
|
In practice, run the model on 100+ normal images to establish a baseline distribution. Set the threshold at the 99th percentile of normal scores — this gives you a 1% false positive rate on normal images.
Common Errors and Fixes#
High false positive rate on normal images
Your training set doesn’t cover enough normal variation. Add more diverse normal examples — different lighting, angles, positions. The model should see every kind of “normal” during training.
Model detects anomalies but can’t localize them
PatchCore and autoencoder error maps give pixel-level localization out of the box. If using CLIP, add GradCAM visualization to highlight which image regions drive the anomaly classification.
Anomalib fit is slow
PatchCore builds a memory bank from all training features. For large datasets, use coreset_sampling_ratio=0.1 to subsample:
1
| model = Patchcore(coreset_sampling_ratio=0.1)
|
All scores are very close together
Your normal and anomalous images are too similar for the model’s resolution. Increase image size, use a deeper backbone (ResNet-50 instead of ResNet-18), or try a different approach like EfficientAD.
Model works on benchmark but fails on real data
Benchmark datasets have clean, centered images. Real production images have varying backgrounds, rotations, and scales. Add preprocessing to normalize your inputs — crop to the region of interest, standardize orientation, and match the lighting conditions used during training.