The first request to your ML API takes 10x longer than every request after it. That’s the cold-start problem. Your model loads weights lazily, CUDA kernels compile on first use, and JIT optimizations haven’t kicked in yet. The fix is straightforward: warm up the model before you accept traffic. Here’s a minimal FastAPI setup that does exactly that.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
from contextlib import asynccontextmanager
from fastapi import FastAPI
from sentence_transformers import SentenceTransformer
import logging
import time

logger = logging.getLogger("uvicorn.error")

models = {}

@asynccontextmanager
async def lifespan(app: FastAPI):
    start = time.perf_counter()
    logger.info("Loading model...")
    models["embedder"] = SentenceTransformer("all-MiniLM-L6-v2")

    logger.info("Running warm-up inference...")
    warmup_texts = ["warm up sentence one", "warm up sentence two"]
    models["embedder"].encode(warmup_texts)

    elapsed = time.perf_counter() - start
    logger.info(f"Model ready in {elapsed:.2f}s")
    models["ready"] = True
    yield
    models.clear()

app = FastAPI(lifespan=lifespan)

That lifespan context manager is the modern FastAPI pattern. Everything before yield runs at startup, everything after runs at shutdown. The old @app.on_event("startup") decorator is deprecated – don’t use it.

Why Cold Starts Hurt

When a model loads for the first time, several things happen behind the scenes:

  • Weight loading: The model reads hundreds of megabytes (or gigabytes) from disk into memory.
  • CUDA kernel compilation: PyTorch compiles GPU kernels the first time they encounter a specific tensor shape. This can add 2-5 seconds to the first inference call.
  • JIT tracing: If you’re using TorchScript or torch.compile, the first pass traces the computation graph and optimizes it.
  • Tokenizer initialization: Tokenizers build their vocabulary lookup structures on first use.

A warm-up routine forces all of this to happen before any real user request hits the server. You run a dummy inference with representative input shapes, and by the time traffic arrives, every kernel is compiled and every cache is hot.

Building the Warm-Up Routine

A good warm-up routine does more than just call the model once. You want to exercise the code paths your production traffic will actually hit. If your API handles variable-length inputs, warm up with short and long texts. If you batch requests, warm up with your typical batch size.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import numpy as np
import time
import logging

logger = logging.getLogger("uvicorn.error")

def warmup_model(model, rounds: int = 3) -> float:
    """Run warm-up inference and return total elapsed time in seconds."""
    sample_inputs = [
        ["short query"],
        ["a medium length sentence that exercises the tokenizer more thoroughly"],
        [f"sentence number {i} in a batch" for i in range(16)],
    ]

    start = time.perf_counter()
    for round_num in range(rounds):
        for batch in sample_inputs:
            embeddings = model.encode(batch, show_progress_bar=False)
            assert embeddings.shape[0] == len(batch), "Output shape mismatch"
        logger.info(f"Warm-up round {round_num + 1}/{rounds} complete")

    elapsed = time.perf_counter() - start
    logger.info(f"Warm-up finished in {elapsed:.2f}s across {rounds} rounds")
    return elapsed

Three rounds is a good default. The first round triggers compilation, the second confirms caching works, and the third gives you a stable baseline latency. The assertions catch shape mismatches early – better to fail at startup than on a live request.

Health Check Endpoints

Two endpoints. Two different jobs.

/health is a liveness probe. It answers one question: is the process alive? If this fails, something is fundamentally broken and the container should be restarted.

/ready is a readiness probe. It answers: is the service ready to handle requests? During model loading and warm-up, liveness returns 200 but readiness returns 503. This tells your load balancer to hold off on routing traffic until the model is actually ready.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
from fastapi import FastAPI, Response
from fastapi.responses import JSONResponse
from contextlib import asynccontextmanager
from sentence_transformers import SentenceTransformer
import time
import logging

logger = logging.getLogger("uvicorn.error")

models = {}
health_state = {"ready": False, "warmup_time": 0.0, "model_name": "all-MiniLM-L6-v2"}

def warmup_model(model, rounds: int = 3) -> float:
    sample_inputs = [
        ["short query"],
        ["a medium length sentence that exercises the tokenizer more thoroughly"],
        [f"sentence number {i} in a batch" for i in range(16)],
    ]
    start = time.perf_counter()
    for round_num in range(rounds):
        for batch in sample_inputs:
            embeddings = model.encode(batch, show_progress_bar=False)
            assert embeddings.shape[0] == len(batch)
        logger.info(f"Warm-up round {round_num + 1}/{rounds} complete")
    elapsed = time.perf_counter() - start
    return elapsed

@asynccontextmanager
async def lifespan(app: FastAPI):
    logger.info(f"Loading model: {health_state['model_name']}")
    models["embedder"] = SentenceTransformer(health_state["model_name"])

    logger.info("Starting warm-up...")
    health_state["warmup_time"] = warmup_model(models["embedder"])
    health_state["ready"] = True
    logger.info(f"Service ready (warm-up took {health_state['warmup_time']:.2f}s)")
    yield
    models.clear()
    health_state["ready"] = False

app = FastAPI(lifespan=lifespan)

@app.get("/health")
async def health():
    return {"status": "alive"}

@app.get("/ready")
async def ready(response: Response):
    if not health_state["ready"]:
        response.status_code = 503
        return {"status": "not_ready", "detail": "Model is still loading or warming up"}
    return {
        "status": "ready",
        "model": health_state["model_name"],
        "warmup_seconds": round(health_state["warmup_time"], 2),
    }

@app.post("/embed")
async def embed(texts: list[str]):
    embeddings = models["embedder"].encode(texts, show_progress_bar=False)
    return {"embeddings": embeddings.tolist()}

The /ready endpoint returns the warm-up time, which is useful for debugging. If warm-up suddenly takes 30 seconds instead of 5, you know something changed in the environment.

Kubernetes Probe Configuration

Kubernetes uses liveness and readiness probes to manage pod lifecycle. Map them directly to the endpoints above.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
apiVersion: apps/v1
kind: Deployment
metadata:
  name: embedding-api
spec:
  replicas: 2
  selector:
    matchLabels:
      app: embedding-api
  template:
    metadata:
      labels:
        app: embedding-api
    spec:
      containers:
        - name: api
          image: embedding-api:latest
          ports:
            - containerPort: 8000
          livenessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 5
            periodSeconds: 10
            failureThreshold: 3
          readinessProbe:
            httpGet:
              path: /ready
              port: 8000
            initialDelaySeconds: 10
            periodSeconds: 5
            failureThreshold: 12
          resources:
            requests:
              memory: "2Gi"
              cpu: "1"
            limits:
              memory: "4Gi"
              nvidia.com/gpu: "1"

Key settings: the readiness probe has a higher failureThreshold (12 retries at 5s intervals = 60s window) because model loading takes time. The liveness probe is more aggressive – if the process is unresponsive for 30 seconds, kill it. Set initialDelaySeconds on the readiness probe high enough that the model has time to start loading. Too low and Kubernetes will mark the pod as failed before it even had a chance.

Common Errors and Fixes

503 Service Unavailable on every request after deployment

Your readiness probe is failing because the warm-up takes longer than initialDelaySeconds + (periodSeconds * failureThreshold). Increase failureThreshold or initialDelaySeconds in your Kubernetes config. Check your warm-up logs to see how long it actually takes.

CUDA out of memory during warm-up

Your warm-up batch size is too large for the GPU. Reduce the batch size in sample_inputs or set a smaller max_seq_length on the model. Warm-up inputs should match production sizes, not exceed them.

Model loads twice (doubling startup time)

This happens when Uvicorn runs with --reload in development. The reloader spawns a monitor process that also triggers the lifespan. Run with --workers 1 --no-reload in production, or gate the warm-up behind an environment variable:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
import os

SKIP_WARMUP = os.getenv("SKIP_WARMUP", "false").lower() == "true"

@asynccontextmanager
async def lifespan(app: FastAPI):
    models["embedder"] = SentenceTransformer("all-MiniLM-L6-v2")
    if not SKIP_WARMUP:
        health_state["warmup_time"] = warmup_model(models["embedder"])
    health_state["ready"] = True
    yield
    models.clear()

Health check returns ready but first real request is still slow

Your warm-up inputs don’t match production input shapes. If production sends 512-token sequences but you warmed up with 5-token strings, CUDA kernels for the longer sequence lengths haven’t been compiled yet. Make your warm-up inputs representative of real traffic.

TypeError: 'NoneType' object is not callable on the model

The model failed to load but you didn’t catch the exception, so models["embedder"] is never set. Always wrap model loading in a try/except and log the error clearly:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
@asynccontextmanager
async def lifespan(app: FastAPI):
    try:
        models["embedder"] = SentenceTransformer("all-MiniLM-L6-v2")
        health_state["warmup_time"] = warmup_model(models["embedder"])
        health_state["ready"] = True
    except Exception as e:
        logger.error(f"Failed to load model: {e}")
        health_state["ready"] = False
        health_state["error"] = str(e)
    yield
    models.clear()

This way, /ready returns 503 with a useful error message instead of the whole service crashing on the first request.