If you can’t reproduce your results, you can’t verify them. And if you can’t verify them, you can’t trust them. Reproducibility isn’t just a nice-to-have for ML experiments — it’s the foundation of trustworthy AI. When an auditor asks “how did you get these numbers?” and you shrug because your training run gives different results every time, that’s a credibility problem. Worse, it’s a safety problem. Non-reproducible experiments make it impossible to audit AI systems, trace bugs back to their source, or confirm that a model behaves the way you claim it does.
The fix is straightforward: seed every source of randomness, lock down non-deterministic operations, and log everything.
Seed Everything
Random numbers come from multiple sources in a typical ML pipeline. Python’s built-in random module, NumPy, PyTorch’s CPU and CUDA generators — each has its own internal state. Miss one and your results drift between runs.
Here’s a seed_everything() function that covers all of them:
| |
Call seed_everything(42) at the very top of your script, before you import models, create datasets, or initialize anything that touches random state.
The PYTHONHASHSEED environment variable controls Python’s hash randomization. It needs to be set before the interpreter starts for full effect, so either export it in your shell or set it in your launch script:
| |
One important note: torch.cuda.manual_seed_all() seeds every GPU. If you only call torch.cuda.manual_seed(), you only seed the current device. Always use the _all variant unless you have a specific reason not to.
Enable Deterministic Operations in PyTorch
Seeding random number generators isn’t enough. PyTorch uses optimized CUDA kernels that are non-deterministic by default because the fastest algorithm for a given operation often involves race conditions between GPU threads. Two runs with the same seed can still produce different results.
Lock this down with three settings:
| |
The performance trade-off is real. cudnn.benchmark = False means PyTorch won’t search for the fastest convolution algorithm for your specific input sizes. On some models, this can slow training by 10-20%. torch.use_deterministic_algorithms(True) is stricter — it forces every operation to use a deterministic implementation, and throws a RuntimeError if one doesn’t exist.
For debugging and auditing, the slowdown is worth it. For large-scale production training where you’ve already validated your pipeline, you might relax benchmark back to True and accept minor floating-point variance.
Put it all together at the start of your training script:
| |
Handle Non-Deterministic Operations
Some operations simply don’t have deterministic CUDA implementations. When you enable torch.use_deterministic_algorithms(True), these will throw errors instead of silently giving different results. That’s the point — you want to know.
Operations that are non-deterministic on CUDA:
torch.Tensor.scatter_add_()— used in GNN message passing and embedding lookupstorch.Tensor.put_()withaccumulate=Truetorch.bincount()on CUDAtorch.nn.functional.interpolate()with certain modes- Any operation that uses
atomicAddinternally (reductions, index operations)
For DataLoader, multi-process data loading introduces randomness because worker processes have independent random states. Fix this with a worker_init_fn:
| |
The generator argument seeds the shuffling. The worker_init_fn seeds each worker’s NumPy and Python random state based on the PyTorch seed. Without both, your data order varies between runs even with a global seed set.
If you hit a non-deterministic operation you can’t avoid, you have two options. First, try running on CPU for that specific operation — CPU implementations are usually deterministic. Second, set the environment variable CUBLAS_WORKSPACE_CONFIG to force deterministic cuBLAS behavior:
| |
This is required for deterministic behavior in some matrix multiplication and convolution operations on CUDA 10.2+.
Log and Track Seeds
Setting a seed is useless if you don’t record which seed you used. Build a simple tracker that logs the seed alongside your experiment metadata:
| |
Use it at the start of every run:
| |
This gives you a JSON file for every experiment with the exact seed, commit hash, PyTorch version, and GPU info. When results don’t match, diff the log files. Nine times out of ten, the problem is a library version change or a different GPU architecture — not the seed.
Common Errors and Fixes
RuntimeError: Deterministic behavior was enabled with either torch.use_deterministic_algorithms(True) but a non-deterministic operation was called.
This happens when you call an operation that has no deterministic CUDA kernel. The full error message tells you which operation caused it. Common culprits are scatter_add_, index_add_, and ctc_loss.
Fix: Set CUBLAS_WORKSPACE_CONFIG and check if a CPU fallback is feasible for that specific operation:
| |
If you absolutely need the non-deterministic op on GPU and can accept it, use torch.use_deterministic_algorithms(True, warn_only=True) to get warnings instead of errors.
Results differ between GPU architectures (e.g., A100 vs V100).
Floating-point operations aren’t guaranteed to produce identical results across different GPU architectures. The hardware-level implementation of fused multiply-add instructions varies. This isn’t a bug — it’s IEEE 754 floating-point arithmetic.
Fix: Pin your GPU architecture in experiment logs and compare results only across the same hardware. If you need cross-architecture reproducibility, run on CPU.
DataLoader returns different batches despite setting a seed.
Usually caused by missing worker_init_fn or not passing a seeded generator to the DataLoader. Also check that you’re not calling seed_everything() after the DataLoader is constructed.
Fix: Use the worker_init_fn and generator pattern shown in the DataLoader section above. Always seed before creating any data loaders or models.
Results differ when changing num_workers in DataLoader.
Each worker processes a different subset of data and applies transforms independently. Changing the number of workers changes which worker handles which sample, altering the random augmentations applied.
Fix: Keep num_workers constant between comparison runs. Log it as part of your experiment config. If you’re using random augmentations, make sure each worker’s seed is derived deterministically from the global seed and worker ID, which the worker_init_fn pattern handles.
Related Guides
- How to Build Adversarial Robustness Testing for Vision Models
- How to Build Automated Fairness Testing for LLM-Generated Content
- How to Build Adversarial Test Suites for ML Models
- How to Build Hallucination Scoring and Grounding Verification for LLMs
- How to Build Automated Prompt Leakage Detection for LLM Apps
- How to Build Automated PII Redaction Testing for LLM Outputs
- How to Build Output Grounding and Fact-Checking for LLM Apps
- How to Build Model Cards and Document AI Systems Responsibly
- How to Build Automated Age and Content Gating for AI Applications
- How to Build Automated Jailbreak Detection for LLM Applications