How to Optimize Docker Images for ML Model Serving

Your ML model works great locally. Then you build a Docker image and it’s 12GB, takes 20 minutes to pull, and breaks in production because CUDA versions don’t match. Here’s how to fix all of that.

The Core Pattern: Multi-Stage Builds

The single biggest win is separating your build environment from your runtime environment. Your build stage needs compilers, headers, and dev tools. Your runtime stage needs almost none of that.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# Stage 1: Build
FROM python:3.11-slim AS builder

RUN apt-get update && apt-get install -y --no-install-recommends \
    build-essential \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir --prefix=/install -r requirements.txt

# Stage 2: Runtime
FROM python:3.11-slim AS runtime

RUN apt-get update && apt-get install -y --no-install-recommends \
    libgomp1 \
    && rm -rf /var/lib/apt/lists/*

COPY --from=builder /install /usr/local
COPY . /app
WORKDIR /app

RUN useradd -r -s /bin/false modeluser
USER modeluser

EXPOSE 8080
HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
    CMD ["python", "-c", "import urllib.request; urllib.request.urlopen('http://localhost:8080/health')"]

CMD ["python", "serve.py"]

That --prefix=/install trick is key. It installs all Python packages into a separate directory tree, which you then copy cleanly into the runtime stage. No pip cache, no wheel files, no build artifacts bleeding through.

Here’s a minimal serve.py that the Dockerfile expects:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
from fastapi import FastAPI
import joblib
import uvicorn

app = FastAPI()
model = joblib.load("models/model.joblib")

@app.get("/health")
def health():
    return {"status": "ok"}

@app.post("/predict")
def predict(features: list[float]):
    prediction = model.predict([features])
    return {"prediction": prediction.tolist()}

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8080)

And the requirements.txt:

1
2
3
4
fastapi==0.115.0
uvicorn==0.32.0
joblib==1.4.2
scikit-learn==1.5.0

Choosing the Right Base Image

This decision matters more than most people think. Here’s the real breakdown:

Base Image	Size	Use Case	Gotchas
`python:3.11-slim`	~150MB	CPU inference	Missing some C libs, add them explicitly
`python:3.11-alpine`	~50MB	Tiny services	musl libc breaks NumPy, PyTorch, most ML libs. Avoid.
`nvidia/cuda:12.4.0-runtime-ubuntu22.04`	~3.5GB	GPU inference	Use `runtime`, never `devel` for serving
`nvidia/cuda:12.4.0-base-ubuntu22.04`	~230MB	GPU with custom CUDA needs	Minimal CUDA, add libraries yourself

Strong opinion: don’t use Alpine for ML workloads. The musl libc incompatibility with scientific Python packages will cost you hours of debugging segfaults. python:3.11-slim is the right default for CPU. For GPU, start with nvidia/cuda:*-runtime-* and install Python on top.

Caching pip Layers Properly

Docker layer caching is your best friend, but you have to structure your Dockerfile to actually use it. The requirements file must be copied and installed before your application code:

1
2
3
4
5
6
7
8
# Good: requirements cached separately
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . /app

# Bad: any code change invalidates pip cache
COPY . /app
RUN pip install --no-cache-dir -r requirements.txt

For even faster builds, use BuildKit’s cache mounts:

1
2
3
4
5
# syntax=docker/dockerfile:1
FROM python:3.11-slim AS builder

RUN --mount=type=cache,target=/root/.cache/pip \
    pip install -r requirements.txt

This persists the pip cache across builds on the same machine. Rebuilds that only change a few packages drop from minutes to seconds.

GPU Image: Full Dockerfile

Here’s a production-ready GPU serving image:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
FROM nvidia/cuda:12.4.0-runtime-ubuntu22.04 AS base

ENV DEBIAN_FRONTEND=noninteractive
ENV PYTHONUNBUFFERED=1
ENV NVIDIA_VISIBLE_DEVICES=all
ENV NVIDIA_DRIVER_CAPABILITIES=compute,utility

RUN apt-get update && apt-get install -y --no-install-recommends \
    python3.11 \
    python3.11-venv \
    python3-pip \
    libcudnn8 \
    && rm -rf /var/lib/apt/lists/* \
    && ln -sf /usr/bin/python3.11 /usr/bin/python

FROM base AS builder
WORKDIR /build

COPY requirements-gpu.txt .
RUN pip install --no-cache-dir --prefix=/install \
    -r requirements-gpu.txt \
    --extra-index-url https://download.pytorch.org/whl/cu124

FROM base AS runtime
COPY --from=builder /install /usr/local

WORKDIR /app
COPY serve.py .
COPY models/ ./models/

RUN useradd -r -s /bin/false modeluser && \
    chown -R modeluser:modeluser /app
USER modeluser

EXPOSE 8080
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
    CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8080/health')"

CMD ["python", "serve.py"]

Run it with nvidia-container-toolkit:

1
docker run --gpus all -p 8080:8080 my-ml-server:latest

Make sure you’ve installed nvidia-container-toolkit on the host:

1
2
3
4
5
6
7
8
9
# Ubuntu/Debian
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | \
    sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg
distribution=$(. /etc/os-release; echo $ID$VERSION_ID)
curl -s -L "https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list" | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

Model Weights: Bake In vs. Download at Startup

Two schools of thought here, and the right answer depends on your deployment:

Bake weights into the image when:

Model is under 2GB
You want immutable, reproducible deployments
Cold start time matters (the image is already pulled and cached)

Download at startup when:

Model is large (5GB+) and you use a shared model store like S3 or GCS
Multiple services share the same weights
You update weights more frequently than code

For baked-in weights, use a dedicated stage to avoid cache invalidation:

1
2
3
4
5
6
7
8
FROM alpine AS model
RUN apk add --no-cache curl
RUN curl -L -o /model.bin https://your-model-store.s3.amazonaws.com/v3/model.bin

FROM python:3.11-slim AS runtime
COPY --from=model /model.bin /app/models/model.bin
COPY --from=builder /install /usr/local
COPY . /app

This way, the model download is cached in its own layer. Code changes don’t re-download the model.

Concrete Size Comparisons

Here’s what these techniques actually save. These are real numbers from a FastAPI + PyTorch text classification server:

Approach	Image Size	Cold Start
Naive `FROM python:3.11`, pip install everything	8.2GB	~45s
Multi-stage with `python:3.11-slim`	2.1GB	~15s
Multi-stage + explicit dependency pruning	1.4GB	~10s
GPU: naive `FROM nvidia/cuda:-devel-`	13.6GB	~90s
GPU: multi-stage with `runtime` base	5.8GB	~30s

The devel vs runtime CUDA base image swap alone saves you nearly 8GB. That’s the compiler toolchain, headers, and static libraries you absolutely don’t need at inference time.

Security Hardening

A few non-negotiable items for production ML containers:

Run as non-root. The useradd + USER directive shown above is mandatory. ML containers running as root with GPU access are a security nightmare.
Use .dockerignore to keep training data, notebooks, and credentials out of the image.
Pin your base image digests in production, not just tags. Tags are mutable.
Scan your images with docker scout or trivy before deploying.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# .dockerignore
*.ipynb
data/
notebooks/
.env
.git
__pycache__
*.pyc
wandb/
runs/

1
2
3
4
5
# Pin base image by digest for reproducibility
# FROM python:3.11-slim@sha256:abc123...

# Scan for vulnerabilities
docker scout cves my-ml-server:latest

Common Errors and Fixes

“ImportError: libcudnn.so.8: cannot open shared object file” You’re using the base CUDA image instead of runtime. Switch to nvidia/cuda:12.4.0-runtime-ubuntu22.04 or explicitly install libcudnn8.

“CUDA error: no kernel image is available for execution on the device” Your PyTorch was compiled for a different CUDA compute capability. Match the CUDA version in your base image to the PyTorch index URL. For CUDA 12.4, use --extra-index-url https://download.pytorch.org/whl/cu124.

Image builds are slow even with caching Check your COPY order. If COPY . /app comes before pip install, every code change invalidates the pip layer. Always copy requirements.txt first, install, then copy the rest.

“OOMKilled” during builds on CI Large packages like PyTorch allocate memory during compilation of C extensions. Set --memory=8g on your Docker build or use pip install --no-cache-dir to avoid pip’s in-memory caching.

Container starts but GPU isn’t detected Three things to check: (1) --gpus all flag on docker run, (2) nvidia-container-toolkit installed on the host, (3) NVIDIA_VISIBLE_DEVICES=all set in the Dockerfile. All three are required.

Image works locally but fails in Kubernetes K8s doesn’t use --gpus all. You need the NVIDIA device plugin deployed to your cluster and a resources.limits block requesting nvidia.com/gpu: 1 in your pod spec.

The Core Pattern: Multi-Stage Builds#

Choosing the Right Base Image#

Caching pip Layers Properly#

GPU Image: Full Dockerfile#

Model Weights: Bake In vs. Download at Startup#

Concrete Size Comparisons#

Security Hardening#

Common Errors and Fixes#

Related Guides#

About the Author