How to Load Test and Benchmark LLM APIs with Locust

Your LLM API works fine with one request at a time. Now throw 50 concurrent users at it and watch the latency spike from 800ms to 12 seconds. Load testing before you ship saves you from discovering these limits in production at 2 AM.

Locust is the best tool for this. It’s Python-native (so you can write custom LLM-specific logic), supports distributed testing across multiple machines, and gives you a real-time web dashboard showing requests per second, latency percentiles, and failure rates. Unlike generic HTTP benchmarking tools like wrk or hey, Locust lets you model realistic user behavior – variable prompt lengths, streaming vs. non-streaming calls, and token-aware throughput metrics.

Install it and get a baseline test running in under 5 minutes:

1
pip install locust openai tiktoken

Writing Your First LLM Load Test

Here’s a Locust file that hits an OpenAI-compatible API (works with OpenAI, vLLM, Ollama, or any server that speaks the same protocol). It sends chat completion requests with varying prompt lengths and tracks both HTTP-level metrics and LLM-specific metrics like tokens per second.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
# locustfile.py
import time
import random
import tiktoken
from locust import HttpUser, task, between, events

# Sample prompts of varying complexity
PROMPTS = [
    "Explain the difference between TCP and UDP in two sentences.",
    "Write a Python function that finds the longest palindromic substring in a given string. Include type hints and a docstring.",
    "You are a senior database administrator. A junior developer asks why their PostgreSQL query with three LEFT JOINs is slow on a table with 50 million rows. Walk through the debugging steps.",
    "Summarize the key differences between transformers and RNNs for sequence modeling. Cover attention mechanisms, parallelization, and memory requirements.",
    "Generate a detailed technical design document outline for a real-time notification system that handles 100k concurrent WebSocket connections.",
]

# For counting tokens
encoder = tiktoken.encoding_for_model("gpt-4o")


class LLMUser(HttpUser):
    wait_time = between(1, 3)  # seconds between requests per user

    @task
    def chat_completion(self):
        prompt = random.choice(PROMPTS)
        input_tokens = len(encoder.encode(prompt))

        payload = {
            "model": "gpt-4o",
            "messages": [{"role": "user", "content": prompt}],
            "max_tokens": 256,
            "temperature": 0.7,
        }

        start = time.perf_counter()
        with self.client.post(
            "/v1/chat/completions",
            json=payload,
            headers={"Authorization": "Bearer YOUR_API_KEY"},
            catch_response=True,
        ) as response:
            elapsed = time.perf_counter() - start

            if response.status_code == 200:
                data = response.json()
                output_tokens = data["usage"]["completion_tokens"]
                total_tokens = data["usage"]["total_tokens"]
                tokens_per_sec = output_tokens / elapsed if elapsed > 0 else 0

                # Fire custom metrics for LLM-specific tracking
                events.request.fire(
                    request_type="LLM",
                    name="tokens_per_second",
                    response_time=tokens_per_sec,
                    response_length=output_tokens,
                    exception=None,
                    context={},
                )
                response.success()
            elif response.status_code == 429:
                response.failure("Rate limited")
            else:
                response.failure(f"HTTP {response.status_code}: {response.text[:200]}")

Run it against your API:

1
locust -f locustfile.py --host https://api.openai.com

This opens the Locust web UI at http://localhost:8089. Set your target number of users and spawn rate, then watch the results stream in. For a quick CLI-only run without the web UI:

1
2
3
locust -f locustfile.py --host https://api.openai.com \
  --headless -u 10 -r 2 --run-time 60s \
  --csv results

That runs 10 concurrent users, spawning 2 per second, for 60 seconds, and dumps CSV results. The -u flag is total users, -r is users spawned per second. Start low and ramp up – jumping straight to 100 users will just get you rate limited.

Measuring What Actually Matters

HTTP response time alone doesn’t tell you much for LLM APIs. A 3-second response that generated 500 tokens is great. A 3-second response that generated 20 tokens is terrible. You need to track these metrics:

Time to First Token (TTFT): How long before the first token arrives. This is what users perceive as “responsiveness.”
Tokens per second (TPS): Output tokens divided by generation time. This is your real throughput metric.
End-to-end latency at p50, p95, p99: Averages lie. The p99 is where your users feel pain.
Error rate under load: At what concurrency does your API start returning 429s or 503s?

Here’s a more advanced test that measures TTFT by using streaming responses:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
# locustfile_streaming.py
import time
import json
import random
from locust import HttpUser, task, between, events

PROMPTS = [
    "List 5 best practices for writing Dockerfiles.",
    "Explain how gradient checkpointing reduces GPU memory usage during training.",
    "Write a bash script that monitors disk usage and sends an alert when any partition exceeds 90%.",
]


class StreamingLLMUser(HttpUser):
    wait_time = between(1, 3)

    @task
    def streaming_completion(self):
        prompt = random.choice(PROMPTS)
        payload = {
            "model": "gpt-4o",
            "messages": [{"role": "user", "content": prompt}],
            "max_tokens": 256,
            "stream": True,
        }

        start = time.perf_counter()
        ttft = None
        token_count = 0

        with self.client.post(
            "/v1/chat/completions",
            json=payload,
            headers={"Authorization": "Bearer YOUR_API_KEY"},
            stream=True,
            catch_response=True,
        ) as response:
            if response.status_code != 200:
                response.failure(f"HTTP {response.status_code}")
                return

            for line in response.iter_lines():
                if not line:
                    continue
                line = line.decode("utf-8").removeprefix("data: ").strip()
                if line == "[DONE]":
                    break
                try:
                    chunk = json.loads(line)
                    delta = chunk["choices"][0]["delta"]
                    if "content" in delta and delta["content"]:
                        if ttft is None:
                            ttft = (time.perf_counter() - start) * 1000
                        token_count += 1
                except (json.JSONDecodeError, KeyError, IndexError):
                    continue

            total_time = time.perf_counter() - start
            response.success()

        # Report TTFT as a custom metric
        if ttft is not None:
            events.request.fire(
                request_type="LLM",
                name="time_to_first_token_ms",
                response_time=ttft,
                response_length=0,
                exception=None,
                context={},
            )

        # Report generation throughput
        if token_count > 0 and total_time > 0:
            tps = token_count / total_time
            events.request.fire(
                request_type="LLM",
                name="output_tokens_per_second",
                response_time=tps,
                response_length=token_count,
                exception=None,
                context={},
            )

Run this with the same locust command. The custom metrics show up in the Locust web UI alongside standard HTTP metrics. Export to CSV for deeper analysis.

Analyzing Results

After a test run, you’ll have CSV files with per-request data. Here’s how to crunch them into a useful benchmark report:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# analyze_results.py
import pandas as pd
import sys

def analyze(csv_path: str):
    df = pd.read_csv(csv_path)

    # Filter to actual API calls (not Locust aggregates)
    api_calls = df[df["Name"] == "/v1/chat/completions"]

    if api_calls.empty:
        print("No API call data found in CSV.")
        sys.exit(1)

    stats = {
        "total_requests": len(api_calls),
        "failures": int(api_calls["Failure Count"].sum()) if "Failure Count" in api_calls.columns else 0,
        "median_response_ms": api_calls["Average Response Time"].median(),
        "p95_response_ms": api_calls["Average Response Time"].quantile(0.95),
        "p99_response_ms": api_calls["Average Response Time"].quantile(0.99),
        "avg_rps": api_calls["Requests/s"].mean() if "Requests/s" in api_calls.columns else None,
    }

    print("=== LLM API Load Test Results ===")
    for key, val in stats.items():
        if val is not None:
            print(f"  {key}: {val:.2f}" if isinstance(val, float) else f"  {key}: {val}")

    error_rate = stats["failures"] / stats["total_requests"] * 100 if stats["total_requests"] > 0 else 0
    print(f"  error_rate: {error_rate:.1f}%")

    # Verdict
    if error_rate > 5:
        print("\nVERDICT: FAIL - Error rate above 5%. Reduce concurrency or scale up.")
    elif stats.get("p99_response_ms") and stats["p99_response_ms"] > 10000:
        print("\nVERDICT: WARNING - p99 latency above 10s. Check for queuing or throttling.")
    else:
        print("\nVERDICT: PASS")

if __name__ == "__main__":
    analyze(sys.argv[1] if len(sys.argv) > 1 else "results_stats.csv")

1
python analyze_results.py results_stats.csv

The numbers you care about: if p99 latency is more than 3x your p50, you have a queuing problem. If error rate climbs above 1% under expected load, you need to either scale horizontally or add request queuing with backpressure.

Scaling Tests Across Multiple Machines

A single laptop can simulate maybe 50-100 concurrent LLM users before your local network or CPU becomes the bottleneck. For serious load testing, distribute Locust across multiple worker machines.

Start the master:

1
locust -f locustfile.py --master --host https://api.openai.com

Then on each worker machine:

1
locust -f locustfile.py --worker --master-host 192.168.1.100

Each worker runs its own set of simulated users. The master aggregates all metrics. You can also run workers as Docker containers:

1
2
3
4
5
6
7
8
# Run 4 distributed workers locally
for i in $(seq 1 4); do
  docker run --rm -d \
    -v $(pwd)/locustfile.py:/app/locustfile.py \
    -w /app \
    locustio/locust \
    -f locustfile.py --worker --master-host host.docker.internal
done

My recommendation: start with a single-machine test at low concurrency (5-10 users) to establish your baseline. Then scale up. If you jump straight to 200 users across 4 workers, you won’t know whether the bottleneck is your API, your test infrastructure, or network saturation.

Setting Realistic Load Profiles

Don’t just ramp to max users and hold. Real traffic has patterns. Locust supports custom load shapes that mimic production usage:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# load_shape.py
from locust import LoadTestShape

class StepLoadShape(LoadTestShape):
    """
    Step load: increase users every 60 seconds.
    Holds at each level to measure stable-state performance.
    """
    stages = [
        {"duration": 60, "users": 5, "spawn_rate": 5},
        {"duration": 120, "users": 10, "spawn_rate": 5},
        {"duration": 180, "users": 25, "spawn_rate": 5},
        {"duration": 240, "users": 50, "spawn_rate": 5},
        {"duration": 300, "users": 100, "spawn_rate": 10},
        {"duration": 360, "users": 50, "spawn_rate": 10},   # scale down
        {"duration": 420, "users": 10, "spawn_rate": 10},
    ]

    def tick(self):
        run_time = self.get_run_time()
        for stage in self.stages:
            if run_time < stage["duration"]:
                return stage["users"], stage["spawn_rate"]
        return None  # stop the test

Drop this in the same directory as your locustfile. Locust discovers it automatically. This step pattern lets you identify the exact concurrency level where latency starts degrading – that’s your capacity ceiling.

Common Errors and Fixes

ConnectionError: Max retries exceeded during the test. Your test machine is running out of TCP connections. Increase the file descriptor limit:

1
ulimit -n 65536

Also make sure you’re not overwhelming a local API. If you’re hitting localhost, the server and the load generator compete for CPU. Run them on separate machines.

All requests return 429 (rate limited). You’ve hit the API provider’s rate limit. For OpenAI, the limits depend on your tier. Add a wait-and-retry pattern to your locustfile, or use Locust’s wait_time = between(2, 5) to slow down. If you’re testing your own API, disable or raise rate limits during the test.

Locust reports 0 requests per second but users are running. Check that --host matches your actual API URL including the scheme. http://localhost:8000 and https://localhost:8000 are different. Also verify the endpoint path in your task matches what the server expects.

Streaming test shows tokens_per_second of 0. The SSE parsing is failing silently. Most OpenAI-compatible servers prefix data lines with data: but some add extra whitespace or use different framing. Add debug logging to your iter_lines() loop to see the raw bytes.

Metrics look wrong after distributed test. Locust workers report raw data to the master, which aggregates it. If workers have clock skew, timing metrics get distorted. Sync clocks with NTP across all machines, or just run workers on the same host using Docker containers.

MemoryError during long test runs. Locust keeps response data in memory by default. For long runs with large LLM responses, add response.release() after processing each response or set catch_response=True and only keep the metrics you need.

Benchmarking Self-Hosted vs. API Providers

Here’s the test matrix I recommend when comparing LLM serving options:

Metric	What to measure	Target
TTFT	Time to first token at p50/p95	< 500ms for interactive use
TPS	Output tokens per second per user	> 30 tok/s for good UX
Throughput	Total tokens/second across all users	Depends on hardware
Error rate	% of failed requests	< 1% under expected load
Cost	Dollar per million tokens at your volume	Provider-specific

Run the same Locust test against each provider with identical prompts and max_tokens. This gives you an apples-to-apples comparison. Just swap the --host flag and API key.

The one thing most benchmarks miss: sustained load vs. burst performance. An API might handle 100 concurrent requests for 10 seconds but degrade badly over 10 minutes as GPU memory fragments or KV caches fill up. Always run tests for at least 5 minutes at each concurrency level to catch this.

Writing Your First LLM Load Test#

Measuring What Actually Matters#

Analyzing Results#

Scaling Tests Across Multiple Machines#

Setting Realistic Load Profiles#

Common Errors and Fixes#

Benchmarking Self-Hosted vs. API Providers#

Related Guides#

About the Author