Your LLM API works fine with one request at a time. Now throw 50 concurrent users at it and watch the latency spike from 800ms to 12 seconds. Load testing before you ship saves you from discovering these limits in production at 2 AM.
Locust is the best tool for this. It’s Python-native (so you can write custom LLM-specific logic), supports distributed testing across multiple machines, and gives you a real-time web dashboard showing requests per second, latency percentiles, and failure rates. Unlike generic HTTP benchmarking tools like wrk or hey, Locust lets you model realistic user behavior – variable prompt lengths, streaming vs. non-streaming calls, and token-aware throughput metrics.
Install it and get a baseline test running in under 5 minutes:
| |
Writing Your First LLM Load Test
Here’s a Locust file that hits an OpenAI-compatible API (works with OpenAI, vLLM, Ollama, or any server that speaks the same protocol). It sends chat completion requests with varying prompt lengths and tracks both HTTP-level metrics and LLM-specific metrics like tokens per second.
| |
Run it against your API:
| |
This opens the Locust web UI at http://localhost:8089. Set your target number of users and spawn rate, then watch the results stream in. For a quick CLI-only run without the web UI:
| |
That runs 10 concurrent users, spawning 2 per second, for 60 seconds, and dumps CSV results. The -u flag is total users, -r is users spawned per second. Start low and ramp up – jumping straight to 100 users will just get you rate limited.
Measuring What Actually Matters
HTTP response time alone doesn’t tell you much for LLM APIs. A 3-second response that generated 500 tokens is great. A 3-second response that generated 20 tokens is terrible. You need to track these metrics:
- Time to First Token (TTFT): How long before the first token arrives. This is what users perceive as “responsiveness.”
- Tokens per second (TPS): Output tokens divided by generation time. This is your real throughput metric.
- End-to-end latency at p50, p95, p99: Averages lie. The p99 is where your users feel pain.
- Error rate under load: At what concurrency does your API start returning 429s or 503s?
Here’s a more advanced test that measures TTFT by using streaming responses:
| |
Run this with the same locust command. The custom metrics show up in the Locust web UI alongside standard HTTP metrics. Export to CSV for deeper analysis.
Analyzing Results
After a test run, you’ll have CSV files with per-request data. Here’s how to crunch them into a useful benchmark report:
| |
| |
The numbers you care about: if p99 latency is more than 3x your p50, you have a queuing problem. If error rate climbs above 1% under expected load, you need to either scale horizontally or add request queuing with backpressure.
Scaling Tests Across Multiple Machines
A single laptop can simulate maybe 50-100 concurrent LLM users before your local network or CPU becomes the bottleneck. For serious load testing, distribute Locust across multiple worker machines.
Start the master:
| |
Then on each worker machine:
| |
Each worker runs its own set of simulated users. The master aggregates all metrics. You can also run workers as Docker containers:
| |
My recommendation: start with a single-machine test at low concurrency (5-10 users) to establish your baseline. Then scale up. If you jump straight to 200 users across 4 workers, you won’t know whether the bottleneck is your API, your test infrastructure, or network saturation.
Setting Realistic Load Profiles
Don’t just ramp to max users and hold. Real traffic has patterns. Locust supports custom load shapes that mimic production usage:
| |
Drop this in the same directory as your locustfile. Locust discovers it automatically. This step pattern lets you identify the exact concurrency level where latency starts degrading – that’s your capacity ceiling.
Common Errors and Fixes
ConnectionError: Max retries exceeded during the test.
Your test machine is running out of TCP connections. Increase the file descriptor limit:
| |
Also make sure you’re not overwhelming a local API. If you’re hitting localhost, the server and the load generator compete for CPU. Run them on separate machines.
All requests return 429 (rate limited).
You’ve hit the API provider’s rate limit. For OpenAI, the limits depend on your tier. Add a wait-and-retry pattern to your locustfile, or use Locust’s wait_time = between(2, 5) to slow down. If you’re testing your own API, disable or raise rate limits during the test.
Locust reports 0 requests per second but users are running.
Check that --host matches your actual API URL including the scheme. http://localhost:8000 and https://localhost:8000 are different. Also verify the endpoint path in your task matches what the server expects.
Streaming test shows tokens_per_second of 0.
The SSE parsing is failing silently. Most OpenAI-compatible servers prefix data lines with data: but some add extra whitespace or use different framing. Add debug logging to your iter_lines() loop to see the raw bytes.
Metrics look wrong after distributed test. Locust workers report raw data to the master, which aggregates it. If workers have clock skew, timing metrics get distorted. Sync clocks with NTP across all machines, or just run workers on the same host using Docker containers.
MemoryError during long test runs.
Locust keeps response data in memory by default. For long runs with large LLM responses, add response.release() after processing each response or set catch_response=True and only keep the metrics you need.
Benchmarking Self-Hosted vs. API Providers
Here’s the test matrix I recommend when comparing LLM serving options:
| Metric | What to measure | Target |
|---|---|---|
| TTFT | Time to first token at p50/p95 | < 500ms for interactive use |
| TPS | Output tokens per second per user | > 30 tok/s for good UX |
| Throughput | Total tokens/second across all users | Depends on hardware |
| Error rate | % of failed requests | < 1% under expected load |
| Cost | Dollar per million tokens at your volume | Provider-specific |
Run the same Locust test against each provider with identical prompts and max_tokens. This gives you an apples-to-apples comparison. Just swap the --host flag and API key.
The one thing most benchmarks miss: sustained load vs. burst performance. An API might handle 100 concurrent requests for 10 seconds but degrade badly over 10 minutes as GPU memory fragments or KV caches fill up. Always run tests for at least 5 minutes at each concurrency level to catch this.
Related Guides
- How to Build a Model Load Testing Pipeline with Locust and FastAPI
- How to A/B Test LLM Prompts and Models in Production
- How to Route LLM Traffic by Cost and Complexity Using Intelligent Model Routing
- How to Autoscale LLM Inference on Kubernetes with KEDA
- How to Monitor LLM Apps with LangSmith
- How to Serve ML Models with BentoML and Build Prediction APIs
- How to Build a Model Compression Pipeline with Pruning and Quantization
- How to Implement Canary Deployments for ML Models
- How to Build a Model Configuration Management Pipeline with Hydra
- How to Serve LLMs in Production with SGLang