The KV Cache Is Your Bottleneck
Every transformer-based LLM stores key-value tensors from previous tokens so it doesn’t recompute attention over the entire sequence on each new token. This is the KV cache, and during inference it’s almost always the thing that limits how many requests you can serve concurrently.
For a Llama 3.1 8B model with a 4096-token context, each request’s KV cache eats roughly 1 GB of VRAM. Your 80 GB A100 loads the model weights (~16 GB in bf16), reserves some headroom, and the remaining ~55 GB goes to KV cache. That caps you at about 50 concurrent sequences. Want to serve 200? You need to manage that memory much more carefully.
Here’s the fastest way to see this in action with vLLM:
| |
That single command gives you PagedAttention, continuous batching, and prefix caching out of the box. The rest of this guide explains what each of those does and how to tune them.
How PagedAttention Works
Traditional KV cache implementations allocate a contiguous block of GPU memory for each request upfront, sized for the maximum possible sequence length. If your max_model_len is 4096 tokens but the average request only uses 500 tokens, you’re wasting 87% of allocated KV cache memory. Worse, the contiguous allocation causes fragmentation – small gaps between blocks that can’t be used by new requests.
PagedAttention (introduced in the vLLM paper) borrows the concept of virtual memory paging from operating systems. Instead of one contiguous allocation per sequence, it splits the KV cache into fixed-size blocks (default 16 tokens per block). Each sequence gets a page table that maps logical KV positions to physical blocks scattered anywhere in GPU memory.
This gives you three wins:
- No internal fragmentation. Memory is allocated in small blocks, so you waste at most one block per sequence instead of thousands of tokens.
- No external fragmentation. Blocks don’t need to be contiguous, so free memory can always be used.
- Memory sharing. Sequences with the same prefix (like a shared system prompt) can point to the same physical blocks via copy-on-write.
In practice, PagedAttention increases serving throughput by 2-4x compared to naive contiguous allocation on the same hardware. vLLM uses it by default – you don’t need to enable it.
Configuring vLLM for KV Cache Performance
The parameters that matter most for KV cache behavior:
| |
| Flag | What It Controls | Default |
|---|---|---|
--gpu-memory-utilization | Fraction of GPU memory for model + KV cache | 0.9 |
--max-model-len | Maximum context length (sets KV cache upper bound) | Model default |
--block-size | Tokens per PagedAttention block | 16 |
--max-num-seqs | Maximum concurrent sequences | 256 |
--enable-prefix-caching | Reuse KV blocks across requests with shared prefixes | Disabled |
--kv-cache-dtype | Data type for KV cache (auto, fp8, fp8_e5m2, fp8_e4m3) | auto |
The biggest lever is --gpu-memory-utilization. Bump it to 0.92-0.95 if your GPU isn’t running other workloads. Every percentage point you recover translates directly to more KV cache blocks and more concurrent requests.
Setting --max-model-len lower than the model’s maximum context is the second biggest lever. If your use case only needs 4096 tokens, don’t leave it at 128K – you’ll allocate space for blocks you’ll never use and the scheduler profile planner will reserve memory based on that ceiling.
Prefix Caching for Shared System Prompts
If you’re running a chatbot where every request starts with the same 500-token system prompt, you’re recomputing and storing those KV values for every single request. Prefix caching eliminates that waste.
With --enable-prefix-caching, vLLM hashes the token blocks of each prompt and stores them in an LRU cache. When a new request comes in with the same prefix, the server reuses the cached KV blocks instead of recomputing them.
| |
The performance gain depends on your prefix length and request volume. For a 500-token system prompt serving 100 requests/second, prefix caching can cut time-to-first-token by 30-50% and save proportional KV memory. Check the vLLM metrics to verify it’s working:
| |
KV Cache Quantization with FP8
By default, KV cache values are stored in the same dtype as the model (usually bf16 or fp16). On Hopper GPUs (H100, H200) and Ada Lovelace (RTX 4090), you can quantize the KV cache to FP8, cutting its memory footprint in half and roughly doubling the number of concurrent sequences.
| |
The --kv-cache-dtype fp8 flag tells vLLM to quantize the KV cache to 8-bit floating point. The model weights themselves stay in bf16 – only the cached attention keys and values get quantized. In most benchmarks, FP8 KV cache has negligible impact on output quality while nearly doubling throughput capacity.
Two variants are available:
fp8orfp8_e4m3– 4 exponent bits, 3 mantissa bits. Better precision, recommended for most use cases.fp8_e5m2– 5 exponent bits, 2 mantissa bits. Wider range but less precision.
FP8 KV cache requires hardware support. It works on H100, H200, L40S, RTX 4090, and newer. On older GPUs like A100, you’ll get an error.
Monitoring KV Cache Usage
vLLM exposes Prometheus metrics that tell you exactly how your KV cache is being used. These are the ones you should alert on:
| |
The critical metrics:
vllm:gpu_cache_usage_perc– Percentage of KV cache blocks in use. When this hits 100%, new requests queue up or get preempted.vllm:cpu_cache_usage_perc– Percentage of CPU swap space in use. If this is high, sequences are being swapped out frequently.vllm:num_requests_running– Currently processing requests.vllm:num_requests_waiting– Queued requests waiting for KV cache space.
Here’s a quick monitoring script:
| |
If gpu_cache_usage_perc is consistently above 90%, you have three options: lower --max-model-len, enable --kv-cache-dtype fp8, or add more GPUs with tensor parallelism.
Continuous Batching and KV Cache Interaction
Traditional static batching waits for all requests in a batch to finish before accepting new ones. A batch of 8 requests where 7 finish in 50ms and 1 takes 500ms means 7 GPU slots sit idle for 450ms.
Continuous batching (also called iteration-level scheduling) evaluates the batch at every token generation step. When a request completes, a new request takes its slot immediately. This is built into vLLM and works hand-in-hand with PagedAttention.
The KV cache interaction matters here: when a new request joins a running batch, PagedAttention allocates blocks incrementally as the sequence grows. When a request finishes, its blocks are freed instantly for the next request. No defragmentation needed, no memory reservation gaps.
Tune these parameters together for optimal throughput:
| |
--enable-chunked-prefill is worth calling out. It breaks long prefill (prompt processing) operations into chunks so they don’t block decoding (token generation) for other requests. Without it, a single 8K-token prompt stalls all other requests during its prefill phase. With it, the scheduler interleaves prefill chunks with decode steps, keeping latency predictable for all requests.
Common Errors and Fixes
ValueError: The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache
Your GPU doesn’t have enough memory to allocate KV cache blocks for the full context length. Lower --max-model-len to what you actually need:
| |
torch.cuda.OutOfMemoryError during server startup
The model itself doesn’t fit. Try quantization or tensor parallelism:
| |
RuntimeError: FlashAttention only supports fp16 and bf16 when using --kv-cache-dtype fp8
Your GPU doesn’t support FP8. This requires Hopper (H100) or Ada Lovelace (RTX 4090) architecture. On A100 or older, remove the --kv-cache-dtype fp8 flag and use bf16 KV cache instead.
Requests are queuing but GPU utilization is low.
This usually means the KV cache is full but the GPU compute isn’t saturated. Check vllm:gpu_cache_usage_perc – if it’s at 100%, you need more KV cache space. Try:
- Lower
--max-model-lenif your sequences are shorter than the limit - Enable
--kv-cache-dtype fp8on supported hardware - Reduce
--max-num-seqsso fewer sequences compete for cache space - Increase
--gpu-memory-utilizationto 0.95
Prefix caching isn’t improving performance.
Verify your requests actually share a common prefix. The prefix must match at the token level, not just the string level. If you’re using different tokenizer settings or slightly different system prompts, the hashes won’t match. Also, prefix caching has a warmup period – the first request with a given prefix always computes it from scratch.
| |
WARNING: Sequence group has been preempted
vLLM is evicting sequences when the KV cache runs out. This is expected under heavy load, but frequent preemption kills throughput. Reduce concurrent load with --max-num-seqs, increase GPU memory allocation with --gpu-memory-utilization 0.95, or add more GPU capacity with tensor parallelism.
Related Guides
- How to Speed Up LLM Inference with Speculative Decoding
- How to Optimize Docker Images for ML Model Serving
- How to Profile and Optimize GPU Memory for LLM Training
- How to Deploy DeepSeek R1 on NVIDIA Blackwell with vLLM’s Disaggregated Serving
- How to Use PyTorch FlexAttention for Fast LLM Inference
- How to Build a Model Serving Cluster with Ray Serve and Docker
- How to Scale ML Training and Inference with Ray
- How to Build a Model Serving Gateway with Envoy and gRPC
- How to Build a Model Serving Autoscaler with Custom Metrics and Kubernetes
- How to Build a Model Inference Cache with Redis and Semantic Hashing