Cold starts are the single worst thing about ML inference in production. A 3GB model loading from disk takes 30-60 seconds. Your user is staring at a spinner. Your SLA is toast. The fix is a warm pool: a set of ECS containers that have models already loaded in memory, sitting idle and ready to serve the moment a request arrives. You pay for idle compute, but you get sub-100ms response times from the first request. That tradeoff is worth it for any latency-sensitive workload.
The idea is simple. Bake the model into your Docker image, load it into memory at container startup, and keep more containers running than you strictly need. ECS handles the orchestration. You handle the architecture.
The Preloaded Container Pattern
The key insight: load your model during container startup, not at request time. FastAPI’s lifespan context manager is the right place for this. The model lives in application state for the entire lifetime of the process.
Here’s a Dockerfile that bakes a sentence-transformers model directly into the image:
| |
Notice --start-period=60s on the health check. That gives the model time to load into memory before ECS starts checking. Without it, ECS kills your container before the model finishes loading.
Now the FastAPI app that loads the model at startup using lifespan:
| |
The model is already on disk inside the image (downloaded during docker build). At startup, SentenceTransformer("all-MiniLM-L6-v2") loads from the local cache into RAM in a few seconds instead of downloading from Hugging Face Hub. That’s the preloaded pattern in action.
ECS Service Configuration
The warm pool strategy is straightforward: run more containers than your current traffic demands. If you need 2 containers to handle peak load, run 4. The extra 2 sit idle with models in memory, ready to absorb traffic spikes instantly.
Here’s the boto3 code to set up the ECS service with a warm pool:
| |
The MinCapacity=4 is critical. Auto-scaling might want to scale down to 1 container during low traffic, which defeats the whole point. Set the floor to whatever your warm pool size should be.
Health Check and Readiness Probes
There’s an important distinction between “alive” and “ready.” A container can be alive (process running, accepting TCP connections) but not ready (model still loading). ECS needs to know the difference.
The /health endpoint returns 200 as soon as the server starts. The /ready endpoint returns 200 only after the model finishes loading. Use /ready in your ECS health check and ALB target group health check so traffic only routes to containers with loaded models.
In your ECS task definition, the health check hits /ready:
| |
The startPeriod of 90 seconds is your grace window. During this time, failed health checks don’t count. For larger models (1GB+), bump this to 120 or even 180 seconds. If ECS marks a container as unhealthy during model load, it kills and replaces it, creating an infinite restart loop. You’ll see tasks cycling in the ECS console with no clear error – check the startPeriod first.
For the ALB target group, configure the health check path to /ready with a matcher of 200 and a healthy threshold of 2 consecutive successes. This ensures the load balancer only sends traffic to containers that have confirmed the model is in memory and serving.
Common Errors and Fixes
OOM kills during model load. You’ll see tasks stop with exit code 137 and a CannotPullContainerError or OutOfMemoryError in CloudWatch. The model plus the Python runtime plus FastAPI overhead adds up fast. A 1GB model file can expand to 2-3GB in memory. Set your ECS task memory to at least 2x the model’s disk size. For the all-MiniLM-L6-v2 example above, the model is about 90MB on disk but 4096MB of task memory gives comfortable headroom for the runtime, tokenizer, and inference buffers.
Image too large to pull. Baking models into Docker images makes them big. A 3GB model means a 4GB+ image. ECR pull times alone can take 2-3 minutes on Fargate. Use multi-stage builds to strip build tools from the final image. Consider using an init container that pulls the model from S3 instead if image size gets past 5GB. But for models under 2GB, baking them in is the simplest approach and avoids S3 download failures.
Health check timeout during model load. If startPeriod is too short, ECS will restart your container before the model finishes loading. Symptoms: tasks cycling between RUNNING and STOPPED repeatedly. Fix: increase startPeriod to at least 2x your observed model load time. Check CloudWatch logs for the “Model loaded and ready to serve” message to measure actual load time.
ECS scaling in drains warm containers. Auto-scaling doesn’t know which containers are “warm” and idle versus actively serving. When it scales in, it might kill a warm container that was about to receive traffic. The ScaleInCooldown of 300 seconds helps, but the real fix is setting MinCapacity equal to your desired warm pool size. Don’t let auto-scaling go below your warm floor.
Container starts but /ready never returns 200. Usually means the model path is wrong or the model cache directory doesn’t exist inside the container. If you’re using Hugging Face models, the default cache is ~/.cache/huggingface/. In Docker, ~ resolves to /root/ which is fine if running as root, but breaks if you switch to a non-root user. Set HF_HOME=/app/.cache in your Dockerfile and download to that path explicitly.
Related Guides
- How to Build a Model Warm-Up and Health Check Pipeline with FastAPI
- How to Build a Model Serving Pipeline with LitServe and Lightning
- How to Build a Model Deployment Pipeline with Terraform and AWS
- How to Build a Model A/B Testing Framework with FastAPI
- How to Build a Model Endpoint Load Balancer with NGINX and FastAPI
- How to Build Feature Flags for ML Model Rollouts
- How to Build a Model Load Testing Pipeline with Locust and FastAPI
- How to Build a Model Monitoring Dashboard with Prometheus and Grafana
- How to Build a Model Explainability Dashboard with SHAP and Streamlit
- How to Build a Model Canary Analysis Pipeline with Statistical Tests