The KV Cache Is Your Bottleneck

Every transformer-based LLM stores key-value tensors from previous tokens so it doesn’t recompute attention over the entire sequence on each new token. This is the KV cache, and during inference it’s almost always the thing that limits how many requests you can serve concurrently.

For a Llama 3.1 8B model with a 4096-token context, each request’s KV cache eats roughly 1 GB of VRAM. Your 80 GB A100 loads the model weights (~16 GB in bf16), reserves some headroom, and the remaining ~55 GB goes to KV cache. That caps you at about 50 concurrent sequences. Want to serve 200? You need to manage that memory much more carefully.

Here’s the fastest way to see this in action with vLLM:

1
2
3
4
5
6
7
8
pip install vllm

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --gpu-memory-utilization 0.92 \
  --max-model-len 4096 \
  --enable-prefix-caching

That single command gives you PagedAttention, continuous batching, and prefix caching out of the box. The rest of this guide explains what each of those does and how to tune them.

How PagedAttention Works

Traditional KV cache implementations allocate a contiguous block of GPU memory for each request upfront, sized for the maximum possible sequence length. If your max_model_len is 4096 tokens but the average request only uses 500 tokens, you’re wasting 87% of allocated KV cache memory. Worse, the contiguous allocation causes fragmentation – small gaps between blocks that can’t be used by new requests.

PagedAttention (introduced in the vLLM paper) borrows the concept of virtual memory paging from operating systems. Instead of one contiguous allocation per sequence, it splits the KV cache into fixed-size blocks (default 16 tokens per block). Each sequence gets a page table that maps logical KV positions to physical blocks scattered anywhere in GPU memory.

This gives you three wins:

  • No internal fragmentation. Memory is allocated in small blocks, so you waste at most one block per sequence instead of thousands of tokens.
  • No external fragmentation. Blocks don’t need to be contiguous, so free memory can always be used.
  • Memory sharing. Sequences with the same prefix (like a shared system prompt) can point to the same physical blocks via copy-on-write.

In practice, PagedAttention increases serving throughput by 2-4x compared to naive contiguous allocation on the same hardware. vLLM uses it by default – you don’t need to enable it.

Configuring vLLM for KV Cache Performance

The parameters that matter most for KV cache behavior:

1
2
3
4
5
6
7
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --gpu-memory-utilization 0.92 \
  --max-model-len 8192 \
  --block-size 16 \
  --max-num-seqs 256 \
  --enable-prefix-caching \
  --kv-cache-dtype auto
FlagWhat It ControlsDefault
--gpu-memory-utilizationFraction of GPU memory for model + KV cache0.9
--max-model-lenMaximum context length (sets KV cache upper bound)Model default
--block-sizeTokens per PagedAttention block16
--max-num-seqsMaximum concurrent sequences256
--enable-prefix-cachingReuse KV blocks across requests with shared prefixesDisabled
--kv-cache-dtypeData type for KV cache (auto, fp8, fp8_e5m2, fp8_e4m3)auto

The biggest lever is --gpu-memory-utilization. Bump it to 0.92-0.95 if your GPU isn’t running other workloads. Every percentage point you recover translates directly to more KV cache blocks and more concurrent requests.

Setting --max-model-len lower than the model’s maximum context is the second biggest lever. If your use case only needs 4096 tokens, don’t leave it at 128K – you’ll allocate space for blocks you’ll never use and the scheduler profile planner will reserve memory based on that ceiling.

Prefix Caching for Shared System Prompts

If you’re running a chatbot where every request starts with the same 500-token system prompt, you’re recomputing and storing those KV values for every single request. Prefix caching eliminates that waste.

With --enable-prefix-caching, vLLM hashes the token blocks of each prompt and stores them in an LRU cache. When a new request comes in with the same prefix, the server reuses the cached KV blocks instead of recomputing them.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

SYSTEM_PROMPT = """You are a senior software engineer at a fintech company.
You help users debug Python code related to payment processing.
Always include error handling. Follow PEP 8 style conventions.
Reference the internal payments API documentation when relevant."""

# Every call reuses the cached KV blocks for SYSTEM_PROMPT
# First request computes the KV cache; subsequent requests skip that work
for user_query in [
    "How do I retry a failed payment?",
    "Why is my webhook handler timing out?",
    "Show me how to validate a card token.",
]:
    response = client.chat.completions.create(
        model="meta-llama/Llama-3.1-8B-Instruct",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": user_query},
        ],
        max_tokens=512,
    )
    print(f"Q: {user_query}")
    print(f"A: {response.choices[0].message.content[:100]}...\n")

The performance gain depends on your prefix length and request volume. For a 500-token system prompt serving 100 requests/second, prefix caching can cut time-to-first-token by 30-50% and save proportional KV memory. Check the vLLM metrics to verify it’s working:

1
2
# Look for cache hit rate in Prometheus metrics
curl -s http://localhost:8000/metrics | grep prefix_cache

KV Cache Quantization with FP8

By default, KV cache values are stored in the same dtype as the model (usually bf16 or fp16). On Hopper GPUs (H100, H200) and Ada Lovelace (RTX 4090), you can quantize the KV cache to FP8, cutting its memory footprint in half and roughly doubling the number of concurrent sequences.

1
2
3
4
5
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --gpu-memory-utilization 0.92 \
  --max-model-len 8192 \
  --kv-cache-dtype fp8 \
  --enable-prefix-caching

The --kv-cache-dtype fp8 flag tells vLLM to quantize the KV cache to 8-bit floating point. The model weights themselves stay in bf16 – only the cached attention keys and values get quantized. In most benchmarks, FP8 KV cache has negligible impact on output quality while nearly doubling throughput capacity.

Two variants are available:

  • fp8 or fp8_e4m3 – 4 exponent bits, 3 mantissa bits. Better precision, recommended for most use cases.
  • fp8_e5m2 – 5 exponent bits, 2 mantissa bits. Wider range but less precision.

FP8 KV cache requires hardware support. It works on H100, H200, L40S, RTX 4090, and newer. On older GPUs like A100, you’ll get an error.

Monitoring KV Cache Usage

vLLM exposes Prometheus metrics that tell you exactly how your KV cache is being used. These are the ones you should alert on:

1
2
# Key metrics to track
curl -s http://localhost:8000/metrics | grep -E "cache_usage|num_requests|batch_size"

The critical metrics:

  • vllm:gpu_cache_usage_perc – Percentage of KV cache blocks in use. When this hits 100%, new requests queue up or get preempted.
  • vllm:cpu_cache_usage_perc – Percentage of CPU swap space in use. If this is high, sequences are being swapped out frequently.
  • vllm:num_requests_running – Currently processing requests.
  • vllm:num_requests_waiting – Queued requests waiting for KV cache space.

Here’s a quick monitoring script:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import requests
import time

def get_vllm_stats(base_url: str = "http://localhost:8000") -> dict:
    """Parse key vLLM metrics from the Prometheus endpoint."""
    resp = requests.get(f"{base_url}/metrics", timeout=5)
    stats = {}
    for line in resp.text.split("\n"):
        if line.startswith("#"):
            continue
        for metric in [
            "vllm:gpu_cache_usage_perc",
            "vllm:num_requests_running",
            "vllm:num_requests_waiting",
        ]:
            if line.startswith(metric):
                stats[metric] = float(line.split()[-1])
    return stats

# Poll every 5 seconds
while True:
    stats = get_vllm_stats()
    gpu_cache = stats.get("vllm:gpu_cache_usage_perc", 0) * 100
    running = int(stats.get("vllm:num_requests_running", 0))
    waiting = int(stats.get("vllm:num_requests_waiting", 0))
    print(f"KV Cache: {gpu_cache:.1f}% | Running: {running} | Waiting: {waiting}")

    if gpu_cache > 95:
        print("WARNING: KV cache near capacity, requests will queue")
    time.sleep(5)

If gpu_cache_usage_perc is consistently above 90%, you have three options: lower --max-model-len, enable --kv-cache-dtype fp8, or add more GPUs with tensor parallelism.

Continuous Batching and KV Cache Interaction

Traditional static batching waits for all requests in a batch to finish before accepting new ones. A batch of 8 requests where 7 finish in 50ms and 1 takes 500ms means 7 GPU slots sit idle for 450ms.

Continuous batching (also called iteration-level scheduling) evaluates the batch at every token generation step. When a request completes, a new request takes its slot immediately. This is built into vLLM and works hand-in-hand with PagedAttention.

The KV cache interaction matters here: when a new request joins a running batch, PagedAttention allocates blocks incrementally as the sequence grows. When a request finishes, its blocks are freed instantly for the next request. No defragmentation needed, no memory reservation gaps.

Tune these parameters together for optimal throughput:

1
2
3
4
5
6
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --gpu-memory-utilization 0.92 \
  --max-model-len 4096 \
  --max-num-seqs 128 \
  --max-num-batched-tokens 4096 \
  --enable-chunked-prefill

--enable-chunked-prefill is worth calling out. It breaks long prefill (prompt processing) operations into chunks so they don’t block decoding (token generation) for other requests. Without it, a single 8K-token prompt stalls all other requests during its prefill phase. With it, the scheduler interleaves prefill chunks with decode steps, keeping latency predictable for all requests.

Common Errors and Fixes

ValueError: The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache

Your GPU doesn’t have enough memory to allocate KV cache blocks for the full context length. Lower --max-model-len to what you actually need:

1
2
3
4
# Set context to what your workload actually requires
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.92

torch.cuda.OutOfMemoryError during server startup

The model itself doesn’t fit. Try quantization or tensor parallelism:

1
2
3
4
5
6
7
8
9
# Quantize the model weights
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --quantization awq \
  --max-model-len 4096

# Or split across multiple GPUs
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4 \
  --max-model-len 4096

RuntimeError: FlashAttention only supports fp16 and bf16 when using --kv-cache-dtype fp8

Your GPU doesn’t support FP8. This requires Hopper (H100) or Ada Lovelace (RTX 4090) architecture. On A100 or older, remove the --kv-cache-dtype fp8 flag and use bf16 KV cache instead.

Requests are queuing but GPU utilization is low.

This usually means the KV cache is full but the GPU compute isn’t saturated. Check vllm:gpu_cache_usage_perc – if it’s at 100%, you need more KV cache space. Try:

  1. Lower --max-model-len if your sequences are shorter than the limit
  2. Enable --kv-cache-dtype fp8 on supported hardware
  3. Reduce --max-num-seqs so fewer sequences compete for cache space
  4. Increase --gpu-memory-utilization to 0.95

Prefix caching isn’t improving performance.

Verify your requests actually share a common prefix. The prefix must match at the token level, not just the string level. If you’re using different tokenizer settings or slightly different system prompts, the hashes won’t match. Also, prefix caching has a warmup period – the first request with a given prefix always computes it from scratch.

1
2
# Confirm prefix caching is active in the metrics
curl -s http://localhost:8000/metrics | grep -i "prefix"

WARNING: Sequence group has been preempted

vLLM is evicting sequences when the KV cache runs out. This is expected under heavy load, but frequent preemption kills throughput. Reduce concurrent load with --max-num-seqs, increase GPU memory allocation with --gpu-memory-utilization 0.95, or add more GPU capacity with tensor parallelism.