You built a model API. It handles one request just fine. Now you need to know what happens at 100 concurrent users, 500, or 1,000. Guessing is not a strategy. You need a repeatable load testing pipeline that tells you exactly where your endpoint falls over, and Locust paired with FastAPI is the fastest way to get there.
Here’s the full setup – a FastAPI endpoint serving a real sentence-transformers model, a Locust test file with realistic traffic patterns, and a pass/fail gate you can plug into CI.
| |
Setting Up the FastAPI Model Endpoint
Use FastAPI’s lifespan context manager to load the model once at startup and clean up on shutdown. This avoids the deprecated @app.on_event pattern and keeps the model in application state where all request handlers can access it.
| |
Start it with uvicorn app:app --host 0.0.0.0 --port 8000 --workers 1. One worker is intentional here – you want to establish a single-process baseline first before scaling workers.
Writing the Locust Test File
The key to useful load tests is realistic traffic. Don’t just hammer the endpoint with identical payloads. Vary the batch size, mix in health checks, and weight tasks by how often they actually happen in production.
| |
The @task weights matter. Setting single_embed to weight 8 and batch_embed to weight 2 means roughly 73% of requests are small batches and 18% are larger ones. That matches a typical production traffic mix where most callers send one or two texts at a time.
Configuring Load Profiles
Run Locust in headless mode for CI pipelines. Define three profiles that cover the scenarios you actually care about: gradual ramp-up, sustained load, and traffic spikes.
Ramp-up test – find your throughput ceiling:
| |
This starts at 0 users and adds 5 per second until it hits 100. The --csv flag writes results/ramp_up_stats.csv, results/ramp_up_stats_history.csv, and results/ramp_up_failures.csv. The --html flag generates a self-contained HTML report with charts.
Sustained load test – check stability over time:
| |
Setting --spawn-rate equal to --users means all 50 users start immediately. Run it for 5 minutes and look for latency drift – if p95 keeps climbing, you have a memory leak or resource exhaustion issue.
Spike test – simulate a traffic burst:
| |
All 200 users hit at once. This is where you find out if your endpoint queues gracefully or starts dropping requests.
Analyzing Results and Setting Pass/Fail Thresholds
The CSV files Locust generates have everything you need. Parse them to build an automated quality gate that blocks deployment when latency or error rate crosses your thresholds.
| |
Run it after your load test:
| |
If any metric crosses the threshold, the script exits with code 1. Wire this into your CI pipeline and bad deployments never make it past staging.
Common Errors and Fixes
ConnectionRefusedError: [Errno 111] Connection refused
Locust started before the FastAPI server was ready. The model takes a few seconds to load. Add a wait loop in your CI script:
| |
locust: error: unrecognized arguments: --autostart
The --autostart flag was removed in Locust 2.x. Use --headless instead for non-interactive runs.
High p99 latency with low p50 on CPU inference
This usually means garbage collection pauses or CPU thermal throttling under sustained load. Check with --workers 1 first to isolate the cause. If the problem disappears with fresh processes, GC is the culprit – set PYTHONDONTWRITEBYTECODE=1 and consider running the model in a subprocess pool.
RuntimeError: No model loaded or KeyError on ml_models
Your Locust test started sending requests before the model finished loading. The /health endpoint returns "not_ready" until the lifespan context finishes. Gate your tests on that health check as shown above.
CSV files are empty or missing the Aggregated row
Locust only writes the final aggregated stats after the test completes. If you kill the process with Ctrl+C or SIGKILL, the CSV gets truncated. Always use --run-time to let Locust exit cleanly, or send SIGTERM and wait for graceful shutdown.
Related Guides
- How to Build a Model Warm-Up and Health Check Pipeline with FastAPI
- How to Build a Model A/B Testing Framework with FastAPI
- How to Load Test and Benchmark LLM APIs with Locust
- How to Build a Model Warm Pool with Preloaded Containers on ECS
- How to Build a Model Monitoring Dashboard with Prometheus and Grafana
- How to Build a Model Endpoint Load Balancer with NGINX and FastAPI
- How to Build Feature Flags for ML Model Rollouts
- How to Build a Model Canary Analysis Pipeline with Statistical Tests
- How to Build a Model Deployment Pipeline with Terraform and AWS
- How to Build a Model Serving Pipeline with LitServe and Lightning