Your ML model works great locally. Then you build a Docker image and it’s 12GB, takes 20 minutes to pull, and breaks in production because CUDA versions don’t match. Here’s how to fix all of that.
The Core Pattern: Multi-Stage Builds
The single biggest win is separating your build environment from your runtime environment. Your build stage needs compilers, headers, and dev tools. Your runtime stage needs almost none of that.
| |
That --prefix=/install trick is key. It installs all Python packages into a separate directory tree, which you then copy cleanly into the runtime stage. No pip cache, no wheel files, no build artifacts bleeding through.
Here’s a minimal serve.py that the Dockerfile expects:
| |
And the requirements.txt:
| |
Choosing the Right Base Image
This decision matters more than most people think. Here’s the real breakdown:
| Base Image | Size | Use Case | Gotchas |
|---|---|---|---|
python:3.11-slim | ~150MB | CPU inference | Missing some C libs, add them explicitly |
python:3.11-alpine | ~50MB | Tiny services | musl libc breaks NumPy, PyTorch, most ML libs. Avoid. |
nvidia/cuda:12.4.0-runtime-ubuntu22.04 | ~3.5GB | GPU inference | Use runtime, never devel for serving |
nvidia/cuda:12.4.0-base-ubuntu22.04 | ~230MB | GPU with custom CUDA needs | Minimal CUDA, add libraries yourself |
Strong opinion: don’t use Alpine for ML workloads. The musl libc incompatibility with scientific Python packages will cost you hours of debugging segfaults. python:3.11-slim is the right default for CPU. For GPU, start with nvidia/cuda:*-runtime-* and install Python on top.
Caching pip Layers Properly
Docker layer caching is your best friend, but you have to structure your Dockerfile to actually use it. The requirements file must be copied and installed before your application code:
| |
For even faster builds, use BuildKit’s cache mounts:
| |
This persists the pip cache across builds on the same machine. Rebuilds that only change a few packages drop from minutes to seconds.
GPU Image: Full Dockerfile
Here’s a production-ready GPU serving image:
| |
Run it with nvidia-container-toolkit:
| |
Make sure you’ve installed nvidia-container-toolkit on the host:
| |
Model Weights: Bake In vs. Download at Startup
Two schools of thought here, and the right answer depends on your deployment:
Bake weights into the image when:
- Model is under 2GB
- You want immutable, reproducible deployments
- Cold start time matters (the image is already pulled and cached)
Download at startup when:
- Model is large (5GB+) and you use a shared model store like S3 or GCS
- Multiple services share the same weights
- You update weights more frequently than code
For baked-in weights, use a dedicated stage to avoid cache invalidation:
| |
This way, the model download is cached in its own layer. Code changes don’t re-download the model.
Concrete Size Comparisons
Here’s what these techniques actually save. These are real numbers from a FastAPI + PyTorch text classification server:
| Approach | Image Size | Cold Start |
|---|---|---|
Naive FROM python:3.11, pip install everything | 8.2GB | ~45s |
Multi-stage with python:3.11-slim | 2.1GB | ~15s |
| Multi-stage + explicit dependency pruning | 1.4GB | ~10s |
GPU: naive FROM nvidia/cuda:*-devel-* | 13.6GB | ~90s |
GPU: multi-stage with runtime base | 5.8GB | ~30s |
The devel vs runtime CUDA base image swap alone saves you nearly 8GB. That’s the compiler toolchain, headers, and static libraries you absolutely don’t need at inference time.
Security Hardening
A few non-negotiable items for production ML containers:
- Run as non-root. The
useradd+USERdirective shown above is mandatory. ML containers running as root with GPU access are a security nightmare. - Use
.dockerignoreto keep training data, notebooks, and credentials out of the image. - Pin your base image digests in production, not just tags. Tags are mutable.
- Scan your images with
docker scoutortrivybefore deploying.
| |
| |
Common Errors and Fixes
“ImportError: libcudnn.so.8: cannot open shared object file”
You’re using the base CUDA image instead of runtime. Switch to nvidia/cuda:12.4.0-runtime-ubuntu22.04 or explicitly install libcudnn8.
“CUDA error: no kernel image is available for execution on the device”
Your PyTorch was compiled for a different CUDA compute capability. Match the CUDA version in your base image to the PyTorch index URL. For CUDA 12.4, use --extra-index-url https://download.pytorch.org/whl/cu124.
Image builds are slow even with caching
Check your COPY order. If COPY . /app comes before pip install, every code change invalidates the pip layer. Always copy requirements.txt first, install, then copy the rest.
“OOMKilled” during builds on CI
Large packages like PyTorch allocate memory during compilation of C extensions. Set --memory=8g on your Docker build or use pip install --no-cache-dir to avoid pip’s in-memory caching.
Container starts but GPU isn’t detected
Three things to check: (1) --gpus all flag on docker run, (2) nvidia-container-toolkit installed on the host, (3) NVIDIA_VISIBLE_DEVICES=all set in the Dockerfile. All three are required.
Image works locally but fails in Kubernetes
K8s doesn’t use --gpus all. You need the NVIDIA device plugin deployed to your cluster and a resources.limits block requesting nvidia.com/gpu: 1 in your pod spec.
Related Guides
- How to Build a Model Serving Cluster with Ray Serve and Docker
- How to Serve ML Models with NVIDIA Triton Inference Server
- How to Build a Model Serving Gateway with Envoy and gRPC
- How to Build a Model Serving Autoscaler with Custom Metrics and Kubernetes
- How to Build a Model Serving Cost Dashboard with Prometheus and Grafana
- How to Optimize Model Inference with ONNX Runtime
- How to Build a Model Serving Pipeline with Docker Compose and Traefik
- How to Optimize LLM Serving with KV Cache and PagedAttention
- How to Build a Model Inference Cost Tracking Pipeline with OpenTelemetry
- How to Build a Model Training Pipeline with Lightning Fabric