Quick GPU Benchmark with PyTorch#
Before you spend hours tuning your training pipeline, you need to know what your GPU can actually do. Raw spec sheets tell part of the story. Real benchmarks under ML workloads tell the rest.
Here’s the fastest way to get a baseline. This script measures memory bandwidth and compute throughput in under 30 seconds:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
| import torch
def measure_memory_bandwidth(size_gb=1.0):
"""Measure GPU memory bandwidth with large tensor copies."""
num_elements = int(size_gb * 1e9 / 4) # FP32 = 4 bytes
a = torch.randn(num_elements, device="cuda", dtype=torch.float32)
b = torch.empty_like(a)
# Warm up
b.copy_(a)
torch.cuda.synchronize()
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
iterations = 20
start.record()
for _ in range(iterations):
b.copy_(a)
end.record()
torch.cuda.synchronize()
elapsed_ms = start.elapsed_time(end)
elapsed_s = elapsed_ms / 1000.0
# Each copy reads + writes = 2x data movement
total_bytes = 2 * a.nelement() * a.element_size() * iterations
bandwidth_gbs = (total_bytes / elapsed_s) / 1e9
print(f"Memory Bandwidth: {bandwidth_gbs:.1f} GB/s")
print(f" Tensor size: {size_gb:.1f} GB, Iterations: {iterations}")
print(f" Total time: {elapsed_ms:.1f} ms")
return bandwidth_gbs
def measure_gemm_tflops(m=4096, n=4096, k=4096, dtype=torch.float32):
"""Measure matrix multiplication throughput in TFLOPS."""
a = torch.randn(m, k, device="cuda", dtype=dtype)
b = torch.randn(k, n, device="cuda", dtype=dtype)
# Warm up
torch.mm(a, b)
torch.cuda.synchronize()
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
iterations = 50
start.record()
for _ in range(iterations):
torch.mm(a, b)
end.record()
torch.cuda.synchronize()
elapsed_ms = start.elapsed_time(end)
elapsed_s = elapsed_ms / 1000.0
# FLOPs for matmul: 2 * M * N * K per iteration
total_flops = 2 * m * n * k * iterations
tflops = (total_flops / elapsed_s) / 1e12
dtype_name = str(dtype).split(".")[-1]
print(f"GEMM {dtype_name} ({m}x{k} @ {k}x{n}): {tflops:.2f} TFLOPS")
print(f" Time per matmul: {elapsed_ms / iterations:.2f} ms")
return tflops
if __name__ == "__main__":
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"CUDA: {torch.version.cuda}")
print(f"PyTorch: {torch.__version__}\n")
measure_memory_bandwidth(size_gb=1.0)
print()
measure_gemm_tflops(dtype=torch.float32)
measure_gemm_tflops(dtype=torch.float16)
measure_gemm_tflops(dtype=torch.bfloat16)
|
Run it and you’ll get numbers like these on an A100 80GB:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| GPU: NVIDIA A100-SXM4-80GB
CUDA: 12.4
PyTorch: 2.5.1
Memory Bandwidth: 1842.3 GB/s
Tensor size: 1.0 GB, Iterations: 20
Total time: 21.7 ms
GEMM float32 (4096x4096 @ 4096x4096): 19.12 TFLOPS
Time per matmul: 7.18 ms
GEMM float16 (4096x4096 @ 4096x4096): 142.87 TFLOPS
Time per matmul: 0.96 ms
GEMM bfloat16 (4096x4096 @ 4096x4096): 139.45 TFLOPS
Time per matmul: 0.98 ms
|
Those FP16 numbers should be close to the theoretical peak of 312 TFLOPS on an A100 with tensor cores. If you’re seeing less than 40% of peak, something is wrong with your CUDA setup.
Profiling Model Inference Latency#
Microbenchmarks tell you about raw hardware capability. But what matters is how fast your actual models run. Here’s how to measure inference latency with proper warm-up and statistical rigor:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
| import torch
import torchvision.models as models
import numpy as np
def benchmark_inference(model_name="resnet50", batch_size=32, iterations=100, warmup=10):
"""Benchmark model inference with CUDA event timing."""
model = getattr(models, model_name)(weights=None).cuda().eval()
dummy_input = torch.randn(batch_size, 3, 224, 224, device="cuda")
# Warm up — critical for accurate results
with torch.no_grad():
for _ in range(warmup):
model(dummy_input)
torch.cuda.synchronize()
latencies = []
with torch.no_grad():
for _ in range(iterations):
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record()
model(dummy_input)
end.record()
torch.cuda.synchronize()
latencies.append(start.elapsed_time(end))
latencies = np.array(latencies)
throughput = (batch_size * 1000.0) / np.mean(latencies) # images/sec
print(f"Model: {model_name} | Batch size: {batch_size}")
print(f" Mean latency: {np.mean(latencies):.2f} ms")
print(f" P50 latency: {np.percentile(latencies, 50):.2f} ms")
print(f" P95 latency: {np.percentile(latencies, 95):.2f} ms")
print(f" P99 latency: {np.percentile(latencies, 99):.2f} ms")
print(f" Std dev: {np.std(latencies):.2f} ms")
print(f" Throughput: {throughput:.0f} images/sec")
return {"mean": np.mean(latencies), "p95": np.percentile(latencies, 95), "throughput": throughput}
if __name__ == "__main__":
for bs in [1, 8, 32, 64]:
benchmark_inference("resnet50", batch_size=bs)
print()
|
The warm-up phase matters more than you’d think. The first few iterations trigger CUDA kernel compilation, memory allocation, and cuDNN autotuning. Skip warm-up and your P99 latency will be wildly inflated.
Measuring Training Throughput#
Inference benchmarks are useful, but training throughput is where the money is. This measures samples per second during a real training loop with ResNet-50:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
| import torch
import torch.nn as nn
import torchvision.models as models
import time
def benchmark_training(batch_size=64, steps=50, warmup_steps=5, use_amp=False):
"""Measure training throughput in samples/sec."""
model = models.resnet50(weights=None).cuda()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
criterion = nn.CrossEntropyLoss()
scaler = torch.amp.GradScaler("cuda") if use_amp else None
model.train()
total_samples = 0
start_time = None
for step in range(warmup_steps + steps):
images = torch.randn(batch_size, 3, 224, 224, device="cuda")
labels = torch.randint(0, 1000, (batch_size,), device="cuda")
if step == warmup_steps:
torch.cuda.synchronize()
start_time = time.perf_counter()
optimizer.zero_grad()
if use_amp:
with torch.amp.autocast("cuda"):
output = model(images)
loss = criterion(output, labels)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
else:
output = model(images)
loss = criterion(output, labels)
loss.backward()
optimizer.step()
if step >= warmup_steps:
total_samples += batch_size
torch.cuda.synchronize()
elapsed = time.perf_counter() - start_time
throughput = total_samples / elapsed
precision = "AMP (FP16)" if use_amp else "FP32"
print(f"Training: ResNet-50 | {precision} | Batch: {batch_size}")
print(f" Throughput: {throughput:.1f} samples/sec")
print(f" Time per step: {(elapsed / steps) * 1000:.1f} ms")
print(f" Total steps: {steps}, Elapsed: {elapsed:.2f}s")
return throughput
if __name__ == "__main__":
fp32 = benchmark_training(batch_size=64, use_amp=False)
print()
amp = benchmark_training(batch_size=64, use_amp=True)
print(f"\nAMP speedup: {amp / fp32:.2f}x")
|
On an A100, you should see roughly 1.5-2x throughput improvement with AMP enabled. On older GPUs like V100, the gap is usually smaller because of less efficient tensor core utilization at mixed precision.
Power and Thermal Monitoring with nvidia-smi#
GPU benchmarks without power data are incomplete. You need to know watts per TFLOP to compare GPUs fairly. Run this alongside your benchmarks:
1
2
3
| # Log GPU stats every second to CSV
nvidia-smi --query-gpu=timestamp,name,temperature.gpu,power.draw,utilization.gpu,utilization.memory,memory.used,memory.total,clocks.current.sm,clocks.current.memory \
--format=csv,nounits -l 1 -f gpu_stats.csv
|
Start this in a separate terminal before running your benchmark. Then parse the results:
1
2
3
4
5
6
7
8
9
10
11
12
13
| import pandas as pd
df = pd.read_csv("gpu_stats.csv", skipinitialspace=True)
df.columns = df.columns.str.strip()
print("GPU Thermal & Power Summary")
print(f" GPU: {df['name'].iloc[0]}")
print(f" Avg Temperature: {df['temperature.gpu'].mean():.1f} C")
print(f" Max Temperature: {df['temperature.gpu'].max():.0f} C")
print(f" Avg Power Draw: {df['power.draw [W]'].mean():.1f} W")
print(f" Max Power Draw: {df['power.draw [W]'].max():.1f} W")
print(f" Avg GPU Util: {df['utilization.gpu [%]'].mean():.1f}%")
print(f" Peak Memory: {df['memory.used [MiB]'].max():.0f} MiB / {df['memory.total [MiB]'].iloc[0]} MiB")
|
If your GPU consistently hits the thermal throttle point (typically 83-85C on data center cards), your benchmark results will degrade over time. Watch for SM clock drops in the CSV data – that’s throttling in action.
Using torch.profiler for Detailed Analysis#
When you need to dig deeper than wall-clock timing, torch.profiler breaks down exactly where time goes:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
| import torch
import torchvision.models as models
from torch.profiler import profile, record_function, ProfilerActivity
model = models.resnet50(weights=None).cuda().eval()
inputs = torch.randn(32, 3, 224, 224, device="cuda")
# Warm up
with torch.no_grad():
for _ in range(5):
model(inputs)
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
record_shapes=True,
profile_memory=True,
with_stack=True,
) as prof:
with record_function("resnet50_inference"):
with torch.no_grad():
model(inputs)
# Print top CUDA kernels by time
print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=15))
# Export Chrome trace for visual inspection
prof.export_chrome_trace("resnet50_trace.json")
print("\nTrace saved to resnet50_trace.json")
print("Open in chrome://tracing or https://ui.perfetto.dev")
|
The Chrome trace export is incredibly useful. Open it in Perfetto and you can see every CUDA kernel, memory transfer, and CPU operation on a timeline. Look for gaps between kernels – those are launch overhead or synchronization stalls.
Comparing Results Across GPUs#
Here’s a reference table for what you should roughly expect from common ML GPUs on the benchmarks above (ResNet-50 training, batch size 64):
| GPU | Memory BW (GB/s) | FP32 TFLOPS | FP16 TFLOPS | Train imgs/sec (FP32) | Train imgs/sec (AMP) |
|---|
| RTX 3090 | ~936 | ~12 | ~70 | ~450 | ~780 |
| RTX 4090 | ~1008 | ~18 | ~165 | ~620 | ~1100 |
| A100 80GB | ~2039 | ~19 | ~140 | ~780 | ~1350 |
| H100 SXM | ~3350 | ~60 | ~380 | ~1400 | ~2800 |
These are ballpark figures from real benchmarks, not theoretical peaks. Your numbers will vary based on driver version, CUDA version, cooling, and power limits. If you’re within 70-80% of these numbers, your setup is healthy.
Common Errors and Fixes#
“CUDA out of memory” during GEMM benchmark
Large matrix sizes at FP32 eat memory fast. A 16384x16384 FP32 matrix is 1 GB. Two of them plus the output is 3 GB before you even start. Reduce the matrix size or use FP16:
1
2
| # Use smaller matrices for GPUs with less VRAM
measure_gemm_tflops(m=2048, n=2048, k=2048, dtype=torch.float16)
|
Unexpectedly low TFLOPS
Check that tensor cores are actually being used. PyTorch only dispatches to tensor cores when dimensions are multiples of 8 (FP16) or 16 (INT8). Use matrix sizes like 4096, 8192, not 4000 or 4100.
Also verify your GPU clock speeds aren’t locked to a low power state:
1
2
3
4
5
6
| # Check current clocks
nvidia-smi -q -d CLOCK
# Set max performance mode (requires root)
sudo nvidia-smi -pm 1
sudo nvidia-smi --lock-gpu-clocks=MAX_CLOCK
|
torch.cuda.Event shows 0.0 ms timing
You forgot torch.cuda.synchronize() after end.record(). CUDA events are asynchronous. Without synchronization, you’re querying the event before it actually fires.
Variance in benchmark results
GPU boost clocks fluctuate with temperature and power. For stable results, lock the GPU clocks, ensure adequate cooling, and run at least 50 iterations after warm-up. Also close any other GPU processes – even a desktop compositor uses some GPU resources.
“RuntimeError: No CUDA GPUs are available”
Check your driver and CUDA toolkit match your PyTorch build:
1
2
| python -c "import torch; print(torch.cuda.is_available(), torch.version.cuda)"
nvidia-smi # Check driver version
|
PyTorch CUDA 12.4 needs driver 550+ on Linux. If the versions mismatch, reinstall PyTorch with the correct CUDA version from the PyTorch website.