Model compression is how you take a 100MB model and ship a 15MB version that runs 3x faster on CPU. The two most effective techniques are pruning (removing unnecessary weights) and quantization (reducing weight precision). They work even better together.

This guide walks through building a compression pipeline on ResNet-50 using PyTorch’s built-in pruning and quantization APIs. You’ll apply structured pruning first, then layer on post-training quantization, and measure exactly how much you save.

Setting Up the Baseline

Start with a pretrained ResNet-50 and measure its original size and inference speed. You need these numbers to know if compression actually helped.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import torch
import torchvision.models as models
import os
import time

# Load pretrained ResNet-50
model = models.resnet50(weights=models.ResNet50_Weights.DEFAULT)
model.eval()

# Save and measure original size
torch.save(model.state_dict(), "resnet50_original.pth")
original_size = os.path.getsize("resnet50_original.pth") / (1024 * 1024)
print(f"Original model size: {original_size:.1f} MB")

# Benchmark inference
dummy_input = torch.randn(1, 3, 224, 224)

def benchmark(model, input_tensor, runs=100):
    # Warmup
    for _ in range(10):
        with torch.no_grad():
            model(input_tensor)
    # Timed runs
    start = time.perf_counter()
    for _ in range(runs):
        with torch.no_grad():
            model(input_tensor)
    elapsed = (time.perf_counter() - start) / runs * 1000
    return elapsed

original_latency = benchmark(model, dummy_input)
print(f"Original inference: {original_latency:.1f} ms per batch")

On a typical CPU, you’ll see roughly 97.8 MB and around 80-120 ms per inference. These are your targets to beat.

Unstructured Pruning with Magnitude

Unstructured pruning zeros out individual weights based on their magnitude. Small weights contribute little to the output, so removing them barely affects accuracy.

PyTorch’s torch.nn.utils.prune handles this cleanly:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
import torch.nn.utils.prune as prune

def apply_unstructured_pruning(model, amount=0.3):
    """Prune 30% of weights by magnitude in all Conv2d and Linear layers."""
    for name, module in model.named_modules():
        if isinstance(module, (torch.nn.Conv2d, torch.nn.Linear)):
            prune.l1_unstructured(module, name="weight", amount=amount)
    return model

# Apply to a fresh copy
pruned_model = models.resnet50(weights=models.ResNet50_Weights.DEFAULT)
pruned_model.eval()
pruned_model = apply_unstructured_pruning(pruned_model, amount=0.3)

The amount=0.3 parameter removes 30% of weights. You can push this to 50-60% on ResNet-50 before accuracy drops noticeably.

Measuring Sparsity

After pruning, verify the sparsity actually matches what you asked for:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
def measure_sparsity(model):
    total_params = 0
    zero_params = 0
    for name, module in model.named_modules():
        if isinstance(module, (torch.nn.Conv2d, torch.nn.Linear)):
            weight = module.weight
            total_params += weight.nelement()
            zero_params += (weight == 0).sum().item()
    sparsity = 100.0 * zero_params / total_params
    return sparsity

sparsity = measure_sparsity(pruned_model)
print(f"Model sparsity: {sparsity:.1f}%")  # Should be ~30%

A word of caution: unstructured sparsity doesn’t automatically shrink the saved model file. The zeros still take up space in a dense tensor. You need sparse storage formats or quantization on top to realize actual size savings.

Structured Pruning for Real Speedups

Structured pruning removes entire filters or channels instead of individual weights. This actually changes the tensor shapes, so you get real memory and speed improvements without needing sparse hardware.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
def apply_structured_pruning(model, amount=0.2):
    """Remove 20% of output channels from Conv2d layers by L2 norm."""
    for name, module in model.named_modules():
        if isinstance(module, torch.nn.Conv2d):
            prune.ln_structured(module, name="weight", amount=amount, n=2, dim=0)
    return model

structured_model = models.resnet50(weights=models.ResNet50_Weights.DEFAULT)
structured_model.eval()
structured_model = apply_structured_pruning(structured_model, amount=0.2)

The dim=0 argument targets output channels. Setting n=2 uses the L2 norm to decide which channels matter least. After structured pruning, you should make the pruning permanent before moving on:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
def make_pruning_permanent(model):
    """Remove pruning reparameterization, baking masks into weights."""
    for name, module in model.named_modules():
        if isinstance(module, (torch.nn.Conv2d, torch.nn.Linear)):
            try:
                prune.remove(module, "weight")
            except ValueError:
                pass
    return model

structured_model = make_pruning_permanent(structured_model)

Calling prune.remove() collapses the mask and the original weight into a single tensor. Without this step, quantization won’t work correctly because it sees the pruning hooks as extra parameters.

Dynamic Quantization

Dynamic quantization converts weights to INT8 at save time and quantizes activations on-the-fly during inference. It’s the simplest form of quantization and works well for Linear-heavy models.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
from torch.ao.quantization import quantize_dynamic

dynamic_quant_model = quantize_dynamic(
    structured_model,
    {torch.nn.Linear},
    dtype=torch.qint8,
)

torch.save(dynamic_quant_model.state_dict(), "resnet50_dynamic_quant.pth")
dq_size = os.path.getsize("resnet50_dynamic_quant.pth") / (1024 * 1024)
print(f"Dynamic quantized size: {dq_size:.1f} MB")

Dynamic quantization works best on models dominated by fully connected layers (like transformers). For CNNs like ResNet, static quantization gives better results because it also handles Conv2d.

Static Quantization with Calibration

Static quantization observes real data flowing through the network to determine the optimal scale and zero-point for each layer. This handles both weights and activations, giving the best compression for CNNs.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
from torch.ao.quantization import get_default_qconfig, prepare, convert

# Start fresh with a pruned model
static_model = models.resnet50(weights=models.ResNet50_Weights.DEFAULT)
static_model.eval()
static_model = apply_structured_pruning(static_model, amount=0.2)
static_model = make_pruning_permanent(static_model)

# Set quantization config
static_model.qconfig = get_default_qconfig("x86")

# Fuse common patterns for better quantization (Conv+BN+ReLU)
from torch.ao.quantization import fuse_modules

# Fuse the first block manually as an example
fused_model = torch.ao.quantization.fuse_modules(
    static_model,
    [["conv1", "bn1", "relu"]],
    inplace=False,
)

# Prepare for calibration
prepared_model = prepare(fused_model)

# Calibration: run representative data through the model
calibration_data = torch.randn(32, 3, 224, 224)
with torch.no_grad():
    for i in range(0, 32, 8):
        prepared_model(calibration_data[i:i+8])

# Convert to quantized model
quantized_model = convert(prepared_model)

torch.save(quantized_model.state_dict(), "resnet50_static_quant.pth")
sq_size = os.path.getsize("resnet50_static_quant.pth") / (1024 * 1024)
print(f"Static quantized size: {sq_size:.1f} MB")

In production, replace the random calibration data with a sample of your actual dataset. Around 100-500 representative samples is enough for stable calibration. Using random data works for demonstration but will give suboptimal quantization ranges.

The Full Compression Pipeline

Here’s the complete pipeline that chains pruning and quantization together:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import torch
import torch.nn.utils.prune as prune
import torchvision.models as models
from torch.ao.quantization import quantize_dynamic
import os

def compression_pipeline(prune_amount=0.3, quantize_layers=None):
    if quantize_layers is None:
        quantize_layers = {torch.nn.Linear}

    # Step 1: Load pretrained model
    model = models.resnet50(weights=models.ResNet50_Weights.DEFAULT)
    model.eval()

    # Step 2: Apply unstructured pruning
    for name, module in model.named_modules():
        if isinstance(module, (torch.nn.Conv2d, torch.nn.Linear)):
            prune.l1_unstructured(module, name="weight", amount=prune_amount)

    # Step 3: Make pruning permanent
    for name, module in model.named_modules():
        if isinstance(module, (torch.nn.Conv2d, torch.nn.Linear)):
            try:
                prune.remove(module, "weight")
            except ValueError:
                pass

    # Step 4: Apply dynamic quantization
    compressed = quantize_dynamic(model, quantize_layers, dtype=torch.qint8)

    return compressed

compressed_model = compression_pipeline(prune_amount=0.4)

# Save and measure
torch.save(compressed_model.state_dict(), "resnet50_compressed.pth")
compressed_size = os.path.getsize("resnet50_compressed.pth") / (1024 * 1024)
print(f"Compressed model size: {compressed_size:.1f} MB")

# Benchmark
compressed_latency = benchmark(compressed_model, torch.randn(1, 3, 224, 224))
print(f"Compressed inference: {compressed_latency:.1f} ms per batch")

With 40% pruning plus dynamic INT8 quantization, expect the Linear layers to shrink to roughly one quarter of their original size. The Conv2d layers remain float32 in dynamic mode, so overall savings on ResNet-50 land around 25-30% from dynamic quantization alone. Static quantization pushes this much further since it covers all layer types.

Comparing Results

Here’s what you can expect from each technique on ResNet-50:

TechniqueSize (MB)Size ReductionCPU Latency Change
Original~97.8baseline
Unstructured Pruning 30%~97.8~0% (dense storage)~0%
Structured Pruning 20%~97.8~0% (masks only)~5-10% faster
Dynamic Quantization~74-80~18-24%~10-20% faster
Static Quantization~24-26~73-75%~2-3x faster
Pruning + Static Quant~23-25~74-76%~2-3x faster

The big win comes from static quantization. Pruning on its own doesn’t reduce file size unless you use sparse formats. But pruning plus quantization is the right play – pruning removes the least important weights first, so the remaining weights quantize with less accuracy loss.

Common Errors and Fixes

RuntimeError: Trying to backward through the graph a second time

This happens when you try to fine-tune after pruning without detaching the masks. Call make_pruning_permanent() before any fine-tuning step, or use prune.remove() on each module.

NotImplementedError: Could not run 'quantized::linear'

Quantized ops aren’t supported on GPU. Move your model to CPU before quantization:

1
2
3
4
# This error means you tried to run quantized inference on CUDA
# Fix: ensure model and input are on CPU
model = model.cpu()
input_tensor = input_tensor.cpu()

AttributeError: module has no attribute 'qconfig'

You need to set qconfig on the model before calling prepare(). This is easy to forget:

1
2
3
4
5
6
# Wrong: calling prepare without qconfig
prepared = prepare(model)  # Fails

# Right: set qconfig first
model.qconfig = get_default_qconfig("x86")
prepared = prepare(model)

KeyError when loading a quantized state dict

Quantized models have different parameter names than their float counterparts. You can’t load a quantized state dict into a non-quantized model. Always save and load using the same model architecture:

1
2
3
# Save the full model, not just state_dict, for quantized models
torch.save(quantized_model, "resnet50_quantized_full.pth")
loaded = torch.load("resnet50_quantized_full.pth", weights_only=False)

Fusion errors with fuse_modules

Module fusion requires that the modules you’re fusing are direct children and appear in sequence. Nested modules inside Sequential blocks need their full path:

1
2
# For layers inside layer1[0], use the dotted path
fuse_modules(model, [["layer1.0.conv1", "layer1.0.bn1", "layer1.0.relu"]])