Running a single model server works until it doesn’t. One replica goes down, or traffic spikes during a demo, and suddenly your inference pipeline is a bottleneck. The fix is straightforward: run multiple FastAPI model servers behind NGINX and let it distribute traffic across them.
This guide walks through the full setup – multiple FastAPI model servers, NGINX as a reverse proxy with different load balancing strategies, health checks, sticky sessions, and a Docker Compose config that ties it all together.
The FastAPI Model Server
Each replica runs the same FastAPI app. The model loads once at startup using a lifespan context manager, and a /health endpoint lets NGINX know the server is alive.
| |
The SERVER_ID environment variable identifies which replica handled the request. That’s useful for debugging – you can see exactly which server responded.
NGINX Configuration
NGINX sits in front of the FastAPI replicas. The upstream block defines the pool of servers, and the server block routes traffic to them.
Round-Robin (Default)
Round-robin sends requests to each server in order. It’s the simplest strategy and works well when all replicas have the same capacity.
| |
Least Connections
If some requests take longer than others – common with variable-length text inputs – least_conn sends new requests to the server with the fewest active connections.
| |
Weighted Distribution
When your servers have different hardware, weighted distribution gives more traffic to the beefier machines. A server with weight=3 gets three times more requests than one with weight=1.
| |
Health Checks and Failover
NGINX’s passive health checks mark a server as down after repeated failures. The max_fails and fail_timeout parameters control this behavior.
| |
Here, model-server-3 is a backup – it only receives traffic when the primary servers are both down. If a server fails 3 times within 30 seconds, NGINX stops sending it traffic for the next 30 seconds, then retries.
Sticky Sessions for Stateful Models
Some models maintain conversation state or session context. Use ip_hash to route requests from the same client to the same server consistently.
| |
ip_hash hashes the client’s IP address to pick a server. Same IP always hits the same backend. If that backend goes down, NGINX rehashes to a different one.
Docker Compose Setup
This is where everything comes together. Three FastAPI replicas, one NGINX load balancer, all on the same network.
| |
And the Dockerfile for the model servers:
| |
Start the whole stack:
| |
Test that load balancing works by hitting the predict endpoint multiple times:
| |
With round-robin, you’ll see server_id rotate through server-1, server-2, server-3 in sequence.
Monitoring the Upstream Status
Add NGINX’s stub status module to see active connections and request counts:
| |
Query it with:
| |
The output shows active connections, total accepted connections, and requests handled – enough to spot traffic imbalances across replicas.
Common Errors and Fixes
502 Bad Gateway after starting Docker Compose
NGINX starts before the FastAPI servers finish loading the model. The depends_on with condition: service_healthy in the Compose file fixes this. If you’re still seeing 502s, increase start_period in the health check to give models more time to load.
Uneven traffic distribution with least_conn
If one server handles requests much faster, it gets more traffic – that’s least_conn working as intended. If you want equal distribution regardless of response time, switch to plain round-robin by removing the least_conn directive.
ip_hash not working behind a CDN or another proxy
When all requests appear to come from the same IP (the CDN’s IP), ip_hash sends everything to one server. Pass the real client IP via X-Forwarded-For and use $http_x_forwarded_for in a hash directive instead:
| |
Health check endpoint returns 200 but the model isn’t actually loaded
The /health endpoint in the code above checks that the model key exists in ml_models. If your model loading is async or lazy, add a real inference check – run a small test input through the model in the health endpoint to confirm it’s actually ready.
Connection refused errors after scaling replicas
If you add more replicas with docker compose up --scale model-server=5, NGINX doesn’t know about them – the upstream block has hardcoded server names. For dynamic scaling, use NGINX’s resolver directive to re-resolve DNS periodically:
| |
Note: dynamic DNS resolution with resolve requires NGINX Plus or the open-source nginx-upstream-dynamic-servers module. For the free version, restart NGINX after scaling: docker compose restart nginx.
Related Guides
- How to Build a Model Warm-Up and Health Check Pipeline with FastAPI
- How to Build a Model A/B Testing Framework with FastAPI
- How to Build a Model Warm Pool with Preloaded Containers on ECS
- How to Build a Model Serving Pipeline with LitServe and Lightning
- How to Build a Model Monitoring Dashboard with Prometheus and Grafana
- How to Build Feature Flags for ML Model Rollouts
- How to Build a Model Rollback Pipeline with Health Checks
- How to Build a Model Serving Pipeline with Ray Serve and FastAPI
- How to Build a Model Load Testing Pipeline with Locust and FastAPI
- How to Build a Model Feature Store Pipeline with Redis and FastAPI