How to Benchmark GPU Performance for ML Workloads

Quick GPU Benchmark with PyTorch

Before you spend hours tuning your training pipeline, you need to know what your GPU can actually do. Raw spec sheets tell part of the story. Real benchmarks under ML workloads tell the rest.

Here’s the fastest way to get a baseline. This script measures memory bandwidth and compute throughput in under 30 seconds:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
import torch

def measure_memory_bandwidth(size_gb=1.0):
    """Measure GPU memory bandwidth with large tensor copies."""
    num_elements = int(size_gb * 1e9 / 4)  # FP32 = 4 bytes
    a = torch.randn(num_elements, device="cuda", dtype=torch.float32)
    b = torch.empty_like(a)

    # Warm up
    b.copy_(a)
    torch.cuda.synchronize()

    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)

    iterations = 20
    start.record()
    for _ in range(iterations):
        b.copy_(a)
    end.record()
    torch.cuda.synchronize()

    elapsed_ms = start.elapsed_time(end)
    elapsed_s = elapsed_ms / 1000.0
    # Each copy reads + writes = 2x data movement
    total_bytes = 2 * a.nelement() * a.element_size() * iterations
    bandwidth_gbs = (total_bytes / elapsed_s) / 1e9

    print(f"Memory Bandwidth: {bandwidth_gbs:.1f} GB/s")
    print(f"  Tensor size: {size_gb:.1f} GB, Iterations: {iterations}")
    print(f"  Total time: {elapsed_ms:.1f} ms")
    return bandwidth_gbs

def measure_gemm_tflops(m=4096, n=4096, k=4096, dtype=torch.float32):
    """Measure matrix multiplication throughput in TFLOPS."""
    a = torch.randn(m, k, device="cuda", dtype=dtype)
    b = torch.randn(k, n, device="cuda", dtype=dtype)

    # Warm up
    torch.mm(a, b)
    torch.cuda.synchronize()

    start = torch.cuda.Event(enable_timing=True)
    end = torch.cuda.Event(enable_timing=True)

    iterations = 50
    start.record()
    for _ in range(iterations):
        torch.mm(a, b)
    end.record()
    torch.cuda.synchronize()

    elapsed_ms = start.elapsed_time(end)
    elapsed_s = elapsed_ms / 1000.0
    # FLOPs for matmul: 2 * M * N * K per iteration
    total_flops = 2 * m * n * k * iterations
    tflops = (total_flops / elapsed_s) / 1e12

    dtype_name = str(dtype).split(".")[-1]
    print(f"GEMM {dtype_name} ({m}x{k} @ {k}x{n}): {tflops:.2f} TFLOPS")
    print(f"  Time per matmul: {elapsed_ms / iterations:.2f} ms")
    return tflops

if __name__ == "__main__":
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"CUDA: {torch.version.cuda}")
    print(f"PyTorch: {torch.__version__}\n")

    measure_memory_bandwidth(size_gb=1.0)
    print()
    measure_gemm_tflops(dtype=torch.float32)
    measure_gemm_tflops(dtype=torch.float16)
    measure_gemm_tflops(dtype=torch.bfloat16)

Run it and you’ll get numbers like these on an A100 80GB:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
GPU: NVIDIA A100-SXM4-80GB
CUDA: 12.4
PyTorch: 2.5.1

Memory Bandwidth: 1842.3 GB/s
  Tensor size: 1.0 GB, Iterations: 20
  Total time: 21.7 ms

GEMM float32 (4096x4096 @ 4096x4096): 19.12 TFLOPS
  Time per matmul: 7.18 ms
GEMM float16 (4096x4096 @ 4096x4096): 142.87 TFLOPS
  Time per matmul: 0.96 ms
GEMM bfloat16 (4096x4096 @ 4096x4096): 139.45 TFLOPS
  Time per matmul: 0.98 ms

Those FP16 numbers should be close to the theoretical peak of 312 TFLOPS on an A100 with tensor cores. If you’re seeing less than 40% of peak, something is wrong with your CUDA setup.

Profiling Model Inference Latency

Microbenchmarks tell you about raw hardware capability. But what matters is how fast your actual models run. Here’s how to measure inference latency with proper warm-up and statistical rigor:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import torch
import torchvision.models as models
import numpy as np

def benchmark_inference(model_name="resnet50", batch_size=32, iterations=100, warmup=10):
    """Benchmark model inference with CUDA event timing."""
    model = getattr(models, model_name)(weights=None).cuda().eval()
    dummy_input = torch.randn(batch_size, 3, 224, 224, device="cuda")

    # Warm up — critical for accurate results
    with torch.no_grad():
        for _ in range(warmup):
            model(dummy_input)
    torch.cuda.synchronize()

    latencies = []
    with torch.no_grad():
        for _ in range(iterations):
            start = torch.cuda.Event(enable_timing=True)
            end = torch.cuda.Event(enable_timing=True)

            start.record()
            model(dummy_input)
            end.record()
            torch.cuda.synchronize()

            latencies.append(start.elapsed_time(end))

    latencies = np.array(latencies)
    throughput = (batch_size * 1000.0) / np.mean(latencies)  # images/sec

    print(f"Model: {model_name} | Batch size: {batch_size}")
    print(f"  Mean latency:  {np.mean(latencies):.2f} ms")
    print(f"  P50 latency:   {np.percentile(latencies, 50):.2f} ms")
    print(f"  P95 latency:   {np.percentile(latencies, 95):.2f} ms")
    print(f"  P99 latency:   {np.percentile(latencies, 99):.2f} ms")
    print(f"  Std dev:       {np.std(latencies):.2f} ms")
    print(f"  Throughput:    {throughput:.0f} images/sec")

    return {"mean": np.mean(latencies), "p95": np.percentile(latencies, 95), "throughput": throughput}

if __name__ == "__main__":
    for bs in [1, 8, 32, 64]:
        benchmark_inference("resnet50", batch_size=bs)
        print()

The warm-up phase matters more than you’d think. The first few iterations trigger CUDA kernel compilation, memory allocation, and cuDNN autotuning. Skip warm-up and your P99 latency will be wildly inflated.

Measuring Training Throughput

Inference benchmarks are useful, but training throughput is where the money is. This measures samples per second during a real training loop with ResNet-50:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
import torch
import torch.nn as nn
import torchvision.models as models
import time

def benchmark_training(batch_size=64, steps=50, warmup_steps=5, use_amp=False):
    """Measure training throughput in samples/sec."""
    model = models.resnet50(weights=None).cuda()
    optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
    criterion = nn.CrossEntropyLoss()
    scaler = torch.amp.GradScaler("cuda") if use_amp else None

    model.train()
    total_samples = 0
    start_time = None

    for step in range(warmup_steps + steps):
        images = torch.randn(batch_size, 3, 224, 224, device="cuda")
        labels = torch.randint(0, 1000, (batch_size,), device="cuda")

        if step == warmup_steps:
            torch.cuda.synchronize()
            start_time = time.perf_counter()

        optimizer.zero_grad()

        if use_amp:
            with torch.amp.autocast("cuda"):
                output = model(images)
                loss = criterion(output, labels)
            scaler.scale(loss).backward()
            scaler.step(optimizer)
            scaler.update()
        else:
            output = model(images)
            loss = criterion(output, labels)
            loss.backward()
            optimizer.step()

        if step >= warmup_steps:
            total_samples += batch_size

    torch.cuda.synchronize()
    elapsed = time.perf_counter() - start_time
    throughput = total_samples / elapsed

    precision = "AMP (FP16)" if use_amp else "FP32"
    print(f"Training: ResNet-50 | {precision} | Batch: {batch_size}")
    print(f"  Throughput: {throughput:.1f} samples/sec")
    print(f"  Time per step: {(elapsed / steps) * 1000:.1f} ms")
    print(f"  Total steps: {steps}, Elapsed: {elapsed:.2f}s")
    return throughput

if __name__ == "__main__":
    fp32 = benchmark_training(batch_size=64, use_amp=False)
    print()
    amp = benchmark_training(batch_size=64, use_amp=True)
    print(f"\nAMP speedup: {amp / fp32:.2f}x")

On an A100, you should see roughly 1.5-2x throughput improvement with AMP enabled. On older GPUs like V100, the gap is usually smaller because of less efficient tensor core utilization at mixed precision.

Power and Thermal Monitoring with nvidia-smi

GPU benchmarks without power data are incomplete. You need to know watts per TFLOP to compare GPUs fairly. Run this alongside your benchmarks:

1
2
3
# Log GPU stats every second to CSV
nvidia-smi --query-gpu=timestamp,name,temperature.gpu,power.draw,utilization.gpu,utilization.memory,memory.used,memory.total,clocks.current.sm,clocks.current.memory \
  --format=csv,nounits -l 1 -f gpu_stats.csv

Start this in a separate terminal before running your benchmark. Then parse the results:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
import pandas as pd

df = pd.read_csv("gpu_stats.csv", skipinitialspace=True)
df.columns = df.columns.str.strip()

print("GPU Thermal & Power Summary")
print(f"  GPU: {df['name'].iloc[0]}")
print(f"  Avg Temperature: {df['temperature.gpu'].mean():.1f} C")
print(f"  Max Temperature: {df['temperature.gpu'].max():.0f} C")
print(f"  Avg Power Draw:  {df['power.draw [W]'].mean():.1f} W")
print(f"  Max Power Draw:  {df['power.draw [W]'].max():.1f} W")
print(f"  Avg GPU Util:    {df['utilization.gpu [%]'].mean():.1f}%")
print(f"  Peak Memory:     {df['memory.used [MiB]'].max():.0f} MiB / {df['memory.total [MiB]'].iloc[0]} MiB")

If your GPU consistently hits the thermal throttle point (typically 83-85C on data center cards), your benchmark results will degrade over time. Watch for SM clock drops in the CSV data – that’s throttling in action.

Using torch.profiler for Detailed Analysis

When you need to dig deeper than wall-clock timing, torch.profiler breaks down exactly where time goes:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import torch
import torchvision.models as models
from torch.profiler import profile, record_function, ProfilerActivity

model = models.resnet50(weights=None).cuda().eval()
inputs = torch.randn(32, 3, 224, 224, device="cuda")

# Warm up
with torch.no_grad():
    for _ in range(5):
        model(inputs)

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    record_shapes=True,
    profile_memory=True,
    with_stack=True,
) as prof:
    with record_function("resnet50_inference"):
        with torch.no_grad():
            model(inputs)

# Print top CUDA kernels by time
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=15))

# Export Chrome trace for visual inspection
prof.export_chrome_trace("resnet50_trace.json")
print("\nTrace saved to resnet50_trace.json")
print("Open in chrome://tracing or https://ui.perfetto.dev")

The Chrome trace export is incredibly useful. Open it in Perfetto and you can see every CUDA kernel, memory transfer, and CPU operation on a timeline. Look for gaps between kernels – those are launch overhead or synchronization stalls.

Comparing Results Across GPUs

Here’s a reference table for what you should roughly expect from common ML GPUs on the benchmarks above (ResNet-50 training, batch size 64):

GPU	Memory BW (GB/s)	FP32 TFLOPS	FP16 TFLOPS	Train imgs/sec (FP32)	Train imgs/sec (AMP)
RTX 3090	~936	~12	~70	~450	~780
RTX 4090	~1008	~18	~165	~620	~1100
A100 80GB	~2039	~19	~140	~780	~1350
H100 SXM	~3350	~60	~380	~1400	~2800

These are ballpark figures from real benchmarks, not theoretical peaks. Your numbers will vary based on driver version, CUDA version, cooling, and power limits. If you’re within 70-80% of these numbers, your setup is healthy.

Common Errors and Fixes

“CUDA out of memory” during GEMM benchmark

Large matrix sizes at FP32 eat memory fast. A 16384x16384 FP32 matrix is 1 GB. Two of them plus the output is 3 GB before you even start. Reduce the matrix size or use FP16:

1
2
# Use smaller matrices for GPUs with less VRAM
measure_gemm_tflops(m=2048, n=2048, k=2048, dtype=torch.float16)

Unexpectedly low TFLOPS

Check that tensor cores are actually being used. PyTorch only dispatches to tensor cores when dimensions are multiples of 8 (FP16) or 16 (INT8). Use matrix sizes like 4096, 8192, not 4000 or 4100.

Also verify your GPU clock speeds aren’t locked to a low power state:

1
2
3
4
5
6
# Check current clocks
nvidia-smi -q -d CLOCK

# Set max performance mode (requires root)
sudo nvidia-smi -pm 1
sudo nvidia-smi --lock-gpu-clocks=MAX_CLOCK

torch.cuda.Event shows 0.0 ms timing

You forgot torch.cuda.synchronize() after end.record(). CUDA events are asynchronous. Without synchronization, you’re querying the event before it actually fires.

Variance in benchmark results

GPU boost clocks fluctuate with temperature and power. For stable results, lock the GPU clocks, ensure adequate cooling, and run at least 50 iterations after warm-up. Also close any other GPU processes – even a desktop compositor uses some GPU resources.

“RuntimeError: No CUDA GPUs are available”

Check your driver and CUDA toolkit match your PyTorch build:

1
2
python -c "import torch; print(torch.cuda.is_available(), torch.version.cuda)"
nvidia-smi  # Check driver version

PyTorch CUDA 12.4 needs driver 550+ on Linux. If the versions mismatch, reinstall PyTorch with the correct CUDA version from the PyTorch website.

Quick GPU Benchmark with PyTorch#

Profiling Model Inference Latency#

Measuring Training Throughput#

Power and Thermal Monitoring with nvidia-smi#

Using torch.profiler for Detailed Analysis#

Comparing Results Across GPUs#

Common Errors and Fixes#

Related Guides#

About the Author