How to Build a Model Endpoint Load Balancer with NGINX and FastAPI

Running a single model server works until it doesn’t. One replica goes down, or traffic spikes during a demo, and suddenly your inference pipeline is a bottleneck. The fix is straightforward: run multiple FastAPI model servers behind NGINX and let it distribute traffic across them.

This guide walks through the full setup – multiple FastAPI model servers, NGINX as a reverse proxy with different load balancing strategies, health checks, sticky sessions, and a Docker Compose config that ties it all together.

The FastAPI Model Server

Each replica runs the same FastAPI app. The model loads once at startup using a lifespan context manager, and a /health endpoint lets NGINX know the server is alive.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
# app.py
import os
import time
from contextlib import asynccontextmanager
from typing import Any

from fastapi import FastAPI
from pydantic import BaseModel

ml_models: dict[str, Any] = {}

@asynccontextmanager
async def lifespan(app: FastAPI):
    # Simulate loading a model at startup
    model_name = os.getenv("MODEL_NAME", "sentiment-v1")
    print(f"Loading model: {model_name}")
    # Replace this with your actual model loading logic:
    # ml_models["model"] = joblib.load("model.pkl")
    ml_models["model"] = lambda text: {"label": "positive", "score": 0.95}
    ml_models["model_name"] = model_name
    ml_models["loaded_at"] = time.time()
    yield
    ml_models.clear()
    print("Model unloaded")

app = FastAPI(lifespan=lifespan)

class PredictRequest(BaseModel):
    text: str

class PredictResponse(BaseModel):
    label: str
    score: float
    server_id: str

@app.get("/health")
async def health():
    if "model" not in ml_models:
        from fastapi.responses import JSONResponse
        return JSONResponse({"status": "unhealthy"}, status_code=503)
    return {
        "status": "healthy",
        "model": ml_models["model_name"],
        "uptime_seconds": round(time.time() - ml_models["loaded_at"], 1),
        "server_id": os.getenv("SERVER_ID", "unknown"),
    }

@app.post("/predict", response_model=PredictResponse)
async def predict(req: PredictRequest):
    model = ml_models["model"]
    result = model(req.text)
    return PredictResponse(
        label=result["label"],
        score=result["score"],
        server_id=os.getenv("SERVER_ID", "unknown"),
    )

The SERVER_ID environment variable identifies which replica handled the request. That’s useful for debugging – you can see exactly which server responded.

NGINX Configuration

NGINX sits in front of the FastAPI replicas. The upstream block defines the pool of servers, and the server block routes traffic to them.

Round-Robin (Default)

Round-robin sends requests to each server in order. It’s the simplest strategy and works well when all replicas have the same capacity.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# nginx.conf
upstream model_servers {
    server model-server-1:8000;
    server model-server-2:8000;
    server model-server-3:8000;
}

server {
    listen 80;
    server_name _;

    location / {
        proxy_pass http://model_servers;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_read_timeout 120s;
        proxy_connect_timeout 10s;
    }

    location /nginx-health {
        return 200 'ok';
        add_header Content-Type text/plain;
    }
}

Least Connections

If some requests take longer than others – common with variable-length text inputs – least_conn sends new requests to the server with the fewest active connections.

1
2
3
4
5
6
upstream model_servers {
    least_conn;
    server model-server-1:8000;
    server model-server-2:8000;
    server model-server-3:8000;
}

Weighted Distribution

When your servers have different hardware, weighted distribution gives more traffic to the beefier machines. A server with weight=3 gets three times more requests than one with weight=1.

1
2
3
4
5
upstream model_servers {
    server model-server-1:8000 weight=3;  # GPU server
    server model-server-2:8000 weight=1;  # CPU fallback
    server model-server-3:8000 weight=1;  # CPU fallback
}

Health Checks and Failover

NGINX’s passive health checks mark a server as down after repeated failures. The max_fails and fail_timeout parameters control this behavior.

1
2
3
4
5
6
upstream model_servers {
    least_conn;
    server model-server-1:8000 max_fails=3 fail_timeout=30s;
    server model-server-2:8000 max_fails=3 fail_timeout=30s;
    server model-server-3:8000 max_fails=3 fail_timeout=30s backup;
}

Here, model-server-3 is a backup – it only receives traffic when the primary servers are both down. If a server fails 3 times within 30 seconds, NGINX stops sending it traffic for the next 30 seconds, then retries.

Sticky Sessions for Stateful Models

Some models maintain conversation state or session context. Use ip_hash to route requests from the same client to the same server consistently.

1
2
3
4
5
6
upstream model_servers {
    ip_hash;
    server model-server-1:8000;
    server model-server-2:8000;
    server model-server-3:8000;
}

ip_hash hashes the client’s IP address to pick a server. Same IP always hits the same backend. If that backend goes down, NGINX rehashes to a different one.

Docker Compose Setup

This is where everything comes together. Three FastAPI replicas, one NGINX load balancer, all on the same network.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
# docker-compose.yml
services:
  model-server-1:
    build: .
    environment:
      - SERVER_ID=server-1
      - MODEL_NAME=sentiment-v1
    expose:
      - "8000"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 15s
      timeout: 5s
      retries: 3
      start_period: 30s

  model-server-2:
    build: .
    environment:
      - SERVER_ID=server-2
      - MODEL_NAME=sentiment-v1
    expose:
      - "8000"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 15s
      timeout: 5s
      retries: 3
      start_period: 30s

  model-server-3:
    build: .
    environment:
      - SERVER_ID=server-3
      - MODEL_NAME=sentiment-v1
    expose:
      - "8000"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 15s
      timeout: 5s
      retries: 3
      start_period: 30s

  nginx:
    image: nginx:1.25
    ports:
      - "80:80"
    volumes:
      - ./nginx.conf:/etc/nginx/conf.d/default.conf:ro
    depends_on:
      model-server-1:
        condition: service_healthy
      model-server-2:
        condition: service_healthy
      model-server-3:
        condition: service_healthy

And the Dockerfile for the model servers:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Dockerfile
FROM python:3.12-slim

WORKDIR /app

RUN pip install --no-cache-dir fastapi uvicorn pydantic
RUN apt-get update && apt-get install -y --no-install-recommends curl && rm -rf /var/lib/apt/lists/*

COPY app.py .

CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]

Start the whole stack:

1
docker compose up --build -d

Test that load balancing works by hitting the predict endpoint multiple times:

1
2
3
4
5
for i in $(seq 1 6); do
  curl -s -X POST http://localhost/predict \
    -H "Content-Type: application/json" \
    -d '{"text": "this product is great"}' | python3 -m json.tool
done

With round-robin, you’ll see server_id rotate through server-1, server-2, server-3 in sequence.

Monitoring the Upstream Status

Add NGINX’s stub status module to see active connections and request counts:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
server {
    listen 80;
    server_name _;

    location /nginx-status {
        stub_status on;
        allow 172.16.0.0/12;  # Docker network range
        allow 127.0.0.1;
        deny all;
    }

    location / {
        proxy_pass http://model_servers;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_read_timeout 120s;
        proxy_connect_timeout 10s;
    }
}

Query it with:

1
curl http://localhost/nginx-status

The output shows active connections, total accepted connections, and requests handled – enough to spot traffic imbalances across replicas.

Common Errors and Fixes

502 Bad Gateway after starting Docker Compose

NGINX starts before the FastAPI servers finish loading the model. The depends_on with condition: service_healthy in the Compose file fixes this. If you’re still seeing 502s, increase start_period in the health check to give models more time to load.

Uneven traffic distribution with least_conn

If one server handles requests much faster, it gets more traffic – that’s least_conn working as intended. If you want equal distribution regardless of response time, switch to plain round-robin by removing the least_conn directive.

ip_hash not working behind a CDN or another proxy

When all requests appear to come from the same IP (the CDN’s IP), ip_hash sends everything to one server. Pass the real client IP via X-Forwarded-For and use $http_x_forwarded_for in a hash directive instead:

1
2
3
4
5
6
upstream model_servers {
    hash $http_x_forwarded_for consistent;
    server model-server-1:8000;
    server model-server-2:8000;
    server model-server-3:8000;
}

Health check endpoint returns 200 but the model isn’t actually loaded

The /health endpoint in the code above checks that the model key exists in ml_models. If your model loading is async or lazy, add a real inference check – run a small test input through the model in the health endpoint to confirm it’s actually ready.

Connection refused errors after scaling replicas

If you add more replicas with docker compose up --scale model-server=5, NGINX doesn’t know about them – the upstream block has hardcoded server names. For dynamic scaling, use NGINX’s resolver directive to re-resolve DNS periodically:

1
2
3
4
5
6
7
8
9
upstream model_servers {
    least_conn;
    server model-server:8000 resolve;
}

server {
    resolver 127.0.0.11 valid=10s;  # Docker's internal DNS
    # ...
}

Note: dynamic DNS resolution with resolve requires NGINX Plus or the open-source nginx-upstream-dynamic-servers module. For the free version, restart NGINX after scaling: docker compose restart nginx.

The FastAPI Model Server#

NGINX Configuration#

Round-Robin (Default)#

Least Connections#

Weighted Distribution#

Health Checks and Failover#

Sticky Sessions for Stateful Models#

Docker Compose Setup#

Monitoring the Upstream Status#

Common Errors and Fixes#

Related Guides#

About the Author