Cold starts are the single worst thing about ML inference in production. A 3GB model loading from disk takes 30-60 seconds. Your user is staring at a spinner. Your SLA is toast. The fix is a warm pool: a set of ECS containers that have models already loaded in memory, sitting idle and ready to serve the moment a request arrives. You pay for idle compute, but you get sub-100ms response times from the first request. That tradeoff is worth it for any latency-sensitive workload.

The idea is simple. Bake the model into your Docker image, load it into memory at container startup, and keep more containers running than you strictly need. ECS handles the orchestration. You handle the architecture.

The Preloaded Container Pattern

The key insight: load your model during container startup, not at request time. FastAPI’s lifespan context manager is the right place for this. The model lives in application state for the entire lifetime of the process.

Here’s a Dockerfile that bakes a sentence-transformers model directly into the image:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
FROM python:3.11-slim

WORKDIR /app

RUN pip install --no-cache-dir fastapi uvicorn sentence-transformers

# Download the model at build time so it's baked into the image layer
RUN python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('all-MiniLM-L6-v2')"

COPY app.py .

EXPOSE 8000

HEALTHCHECK --interval=10s --timeout=5s --start-period=60s --retries=3 \
    CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')"

CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]

Notice --start-period=60s on the health check. That gives the model time to load into memory before ECS starts checking. Without it, ECS kills your container before the model finishes loading.

Now the FastAPI app that loads the model at startup using lifespan:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
from contextlib import asynccontextmanager
from fastapi import FastAPI, Response
from pydantic import BaseModel
from sentence_transformers import SentenceTransformer

model_state = {"model": None, "ready": False}


@asynccontextmanager
async def lifespan(app: FastAPI):
    # Startup: load model into memory
    model_state["model"] = SentenceTransformer("all-MiniLM-L6-v2")
    model_state["ready"] = True
    print("Model loaded and ready to serve")
    yield
    # Shutdown: cleanup
    model_state["model"] = None
    model_state["ready"] = False


app = FastAPI(lifespan=lifespan)


class EmbeddingRequest(BaseModel):
    text: str


@app.get("/health")
async def health():
    return {"status": "alive"}


@app.get("/ready")
async def ready():
    if model_state["ready"]:
        return {"status": "ready"}
    return Response(status_code=503, content='{"status": "loading"}')


@app.post("/embed")
async def embed(request: EmbeddingRequest):
    if not model_state["ready"]:
        return Response(status_code=503, content='{"error": "model not loaded"}')
    embedding = model_state["model"].encode(request.text).tolist()
    return {"embedding": embedding}

The model is already on disk inside the image (downloaded during docker build). At startup, SentenceTransformer("all-MiniLM-L6-v2") loads from the local cache into RAM in a few seconds instead of downloading from Hugging Face Hub. That’s the preloaded pattern in action.

ECS Service Configuration

The warm pool strategy is straightforward: run more containers than your current traffic demands. If you need 2 containers to handle peak load, run 4. The extra 2 sit idle with models in memory, ready to absorb traffic spikes instantly.

Here’s the boto3 code to set up the ECS service with a warm pool:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
import boto3

ecs = boto3.client("ecs", region_name="us-east-1")
autoscaling = boto3.client("application-autoscaling", region_name="us-east-1")

CLUSTER = "ml-inference"
SERVICE = "embedding-warm-pool"
TASK_DEF = "embedding-service:3"

# Register a task definition with enough memory for the model
task_response = ecs.register_task_definition(
    family="embedding-service",
    networkMode="awsvpc",
    requiresCompatibilities=["FARGATE"],
    cpu="1024",
    memory="4096",  # 4GB - enough for all-MiniLM-L6-v2 plus headroom
    containerDefinitions=[
        {
            "name": "embedding",
            "image": "123456789012.dkr.ecr.us-east-1.amazonaws.com/embedding-service:latest",
            "portMappings": [{"containerPort": 8000, "protocol": "tcp"}],
            "healthCheck": {
                "command": ["CMD-SHELL", "python -c \"import urllib.request; urllib.request.urlopen('http://localhost:8000/ready')\" || exit 1"],
                "interval": 15,
                "timeout": 5,
                "retries": 3,
                "startPeriod": 90,
            },
            "logConfiguration": {
                "logDriver": "awslogs",
                "options": {
                    "awslogs-group": "/ecs/embedding-service",
                    "awslogs-region": "us-east-1",
                    "awslogs-stream-prefix": "ecs",
                },
            },
        }
    ],
)

task_def_arn = task_response["taskDefinition"]["taskDefinitionArn"]

# Create the service with desired count higher than minimum
# This is the warm pool - extra containers stay idle and ready
ecs.create_service(
    cluster=CLUSTER,
    serviceName=SERVICE,
    taskDefinition=task_def_arn,
    desiredCount=4,  # Warm pool: 4 containers even if 2 handle current load
    launchType="FARGATE",
    networkConfiguration={
        "awsvpcConfiguration": {
            "subnets": ["subnet-abc123", "subnet-def456"],
            "securityGroups": ["sg-warm-pool"],
            "assignPublicIp": "DISABLED",
        }
    },
    deploymentConfiguration={
        "maximumPercent": 200,
        "minimumHealthyPercent": 100,  # Never drop below current count during deploys
    },
    enableExecuteCommand=True,
    tags=[{"key": "purpose", "value": "warm-pool"}],
)

# Set up auto-scaling that maintains a minimum warm buffer
autoscaling.register_scalable_target(
    ServiceNamespace="ecs",
    ResourceId=f"service/{CLUSTER}/{SERVICE}",
    ScalableDimension="ecs:service:DesiredCount",
    MinCapacity=4,  # Never go below 4 - this IS the warm pool floor
    MaxCapacity=12,
)

# Scale up aggressively on CPU, but the floor stays at 4
autoscaling.put_scaling_policy(
    PolicyName="embedding-cpu-scaling",
    ServiceNamespace="ecs",
    ResourceId=f"service/{CLUSTER}/{SERVICE}",
    ScalableDimension="ecs:service:DesiredCount",
    PolicyType="TargetTrackingScaling",
    TargetTrackingScalingPolicyConfiguration={
        "TargetValue": 60.0,  # Scale up when average CPU hits 60%
        "PredefinedMetricSpecification": {
            "PredefinedMetricType": "ECSServiceAverageCPUUtilization"
        },
        "ScaleInCooldown": 300,   # 5 min cooldown before scaling in
        "ScaleOutCooldown": 60,   # Scale out fast
    },
)

The MinCapacity=4 is critical. Auto-scaling might want to scale down to 1 container during low traffic, which defeats the whole point. Set the floor to whatever your warm pool size should be.

Health Check and Readiness Probes

There’s an important distinction between “alive” and “ready.” A container can be alive (process running, accepting TCP connections) but not ready (model still loading). ECS needs to know the difference.

The /health endpoint returns 200 as soon as the server starts. The /ready endpoint returns 200 only after the model finishes loading. Use /ready in your ECS health check and ALB target group health check so traffic only routes to containers with loaded models.

In your ECS task definition, the health check hits /ready:

1
2
3
4
5
6
7
8
9
{
  "healthCheck": {
    "command": ["CMD-SHELL", "python -c \"import urllib.request; urllib.request.urlopen('http://localhost:8000/ready')\" || exit 1"],
    "interval": 15,
    "timeout": 5,
    "retries": 3,
    "startPeriod": 90
  }
}

The startPeriod of 90 seconds is your grace window. During this time, failed health checks don’t count. For larger models (1GB+), bump this to 120 or even 180 seconds. If ECS marks a container as unhealthy during model load, it kills and replaces it, creating an infinite restart loop. You’ll see tasks cycling in the ECS console with no clear error – check the startPeriod first.

For the ALB target group, configure the health check path to /ready with a matcher of 200 and a healthy threshold of 2 consecutive successes. This ensures the load balancer only sends traffic to containers that have confirmed the model is in memory and serving.

Common Errors and Fixes

OOM kills during model load. You’ll see tasks stop with exit code 137 and a CannotPullContainerError or OutOfMemoryError in CloudWatch. The model plus the Python runtime plus FastAPI overhead adds up fast. A 1GB model file can expand to 2-3GB in memory. Set your ECS task memory to at least 2x the model’s disk size. For the all-MiniLM-L6-v2 example above, the model is about 90MB on disk but 4096MB of task memory gives comfortable headroom for the runtime, tokenizer, and inference buffers.

Image too large to pull. Baking models into Docker images makes them big. A 3GB model means a 4GB+ image. ECR pull times alone can take 2-3 minutes on Fargate. Use multi-stage builds to strip build tools from the final image. Consider using an init container that pulls the model from S3 instead if image size gets past 5GB. But for models under 2GB, baking them in is the simplest approach and avoids S3 download failures.

Health check timeout during model load. If startPeriod is too short, ECS will restart your container before the model finishes loading. Symptoms: tasks cycling between RUNNING and STOPPED repeatedly. Fix: increase startPeriod to at least 2x your observed model load time. Check CloudWatch logs for the “Model loaded and ready to serve” message to measure actual load time.

ECS scaling in drains warm containers. Auto-scaling doesn’t know which containers are “warm” and idle versus actively serving. When it scales in, it might kill a warm container that was about to receive traffic. The ScaleInCooldown of 300 seconds helps, but the real fix is setting MinCapacity equal to your desired warm pool size. Don’t let auto-scaling go below your warm floor.

Container starts but /ready never returns 200. Usually means the model path is wrong or the model cache directory doesn’t exist inside the container. If you’re using Hugging Face models, the default cache is ~/.cache/huggingface/. In Docker, ~ resolves to /root/ which is fine if running as root, but breaks if you switch to a non-root user. Set HF_HOME=/app/.cache in your Dockerfile and download to that path explicitly.