Disaggregated serving — splitting LLM inference into separate prefill and decode worker pools — has moved from research paper to production necessity for DeepSeek R1 on NVIDIA Blackwell. The reason is straightforward: prefill is compute-bound and benefits from packing many tokens at high utilization, while decode is memory-bandwidth-bound and sensitive to per-token latency. Running both phases on the same GPUs forces constant compromise. Separating them lets you optimize each independently.
On GB200 hardware, vLLM’s disaggregated setup delivers 26.2K prefill tokens/GPU/second and 10.1K decode tokens/GPU/second — roughly 3–5x what you’d see on H200. At the software level, the latest optimizations added 38% more throughput at max-load and 13% better interactivity at minimum latency across the full Pareto curve.
Here’s how to configure it.
Why Disaggregated Serving Matters for DeepSeek R1
DeepSeek R1 is a 671B MoE model. Only a fraction of experts activate per token, but the KV cache grows large fast during prefill of long reasoning traces. The interaction between expert parallelism and KV cache transfer is the crux of why this matters.
In a standard (aggregated) deployment, when a long prefill request lands on a GPU that’s also mid-decode, the decode requests see their inter-token latency spike — the prefill junk interrupts their KV cache read pattern. Chunked prefill mitigates this, but doesn’t eliminate it.
Disaggregated serving routes all prefill to dedicated GPU pools and all decode to separate pools. The KV cache is transferred over NVLink or NCCL after prefill completes, then the decode worker picks up from there. On GB200 clusters where compute nodes share NVLink-C2C fabric, this transfer is fast enough that the latency penalty is dominated by the benefit.
Prerequisites and Installation
You need vLLM 0.8.0 or later. The disaggregated prefill feature graduated from experimental in 0.7.x.
| |
For multi-node setups, all nodes need shared model weight storage (NFS or object storage mounted at the same path) and NCCL connectivity between them.
Core Disaggregated Setup: Single-Node Two-GPU
Start with the minimal setup — one GPU for prefill, one for decode — to validate the configuration before scaling up.
| |
Key parameters to understand:
kv_role: "kv_producer"— this worker runs prefill and ships KV cache to consumerskv_role: "kv_consumer"— receives KV cache and runs decodekv_buffer_size— buffer in bytes for in-flight KV transfers;2e9(2GB) is a safe starting point for DeepSeek R1’s layer countkv_port— dedicated port for NCCL KV transfer traffic, keep it separate from the HTTP serving port
GB200 Optimized Configuration
For production GB200 deployments, you want tensor parallelism across all NVL-connected GPUs in each instance, FP4 quantization on MoE weights, and expert parallelism turned on. The benchmark numbers (26.2K prefill TPGS, 10.1K decode TPGS) used a 4-prefill-instance × 2-GB200 + 1-decode-instance × 8-GB200 layout.
| |
| |
Flag breakdown:
--enable-expert-parallel— activates Wide-EP, distributes MoE experts across all TP ranks. Critical for DeepSeek R1’s 256-expert architecture--enable-eplb— Expert Parallel Load Balancing; rebalances expert assignments based on observed activation frequency, preventing hot-expert bottlenecks--enable-dbo— Dual Batch Overlap; overlaps compute with all-to-all collective communication during decode, hiding ~20% of dispatch latency--quantization fp8— uses FP8 for attention layers and FP4 for MoE expert weights when FlashInfer MXFP4 backends are enabled (setVLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1)--cuda-graph-capture-size 2048— captures CUDA graphs up to batch size 2048; larger values reduce Python overhead but increase GPU memory at startup
Enabling FP4 MoE Kernels
GB200’s FP4 tensor cores are the hardware advantage that makes the throughput numbers possible. FP4 for MoE expert weights reduces memory bandwidth pressure by 2x vs FP8, and the FlashInfer TRTLLM-Gen GEMM kernels are specifically tuned for GB200’s tensor core layout.
| |
On B200 and GB200 GPUs you should see MoE GEMM throughput roughly double compared to FP8-only mode. On older H100/H200 hardware this env var has no effect — FP4 tensor cores are Blackwell-exclusive.
Testing the Setup
Once the prefill worker, decode worker, and proxy are all running, test with a standard OpenAI-compatible request against the proxy port:
| |
Verify the proxy is routing correctly by checking vLLM’s /metrics endpoint on each worker:
| |
You should see vllm:num_requests_running increment on the prefill worker when a request arrives, then drop to zero and increment on the decode worker once KV cache transfer completes.
Monitoring with Prometheus and Grafana
vLLM exposes Prometheus metrics at /metrics on each server. Wire both workers into your scrape config:
| |
The metrics to watch for disaggregated serving health:
| |
A healthy disaggregated deployment shows TTFT dominated by prefill compute time, ITL dominated by decode memory bandwidth, and neither queue backing up. If the decode queue grows while prefill stays empty, add decode instances. If prefill backs up, add prefill instances.
Scaling to Multi-Node with Ray Serve
For larger clusters, Ray Serve’s LLM API handles the orchestration layer — prefill/decode disaggregation, data parallel routing, and prefix cache affinity routing.
| |
Ray Serve handles request routing automatically — prefill requests go to prefill replicas, KV cache transfers happen over Ray’s object store or direct NCCL, and decode picks up from there.
Common Pitfalls
KV transfer hanging at startup — both workers must be fully initialized before the proxy starts routing requests. Add a readiness check loop (GET /health) rather than a fixed sleep 30.
kv_rank conflicts — each worker in the NCCL group needs a unique kv_rank. In a 4-prefill + 1-decode setup, assign ranks 0–3 to prefill and rank 4 to decode, with kv_parallel_size: 5.
Out of memory during KV transfer — the kv_buffer_size must fit in GPU memory alongside the model weights and KV cache. On a 671B FP8 model, each GB200 holds ~140GB of weights, leaving ~40GB headroom. Keep kv_buffer_size under 10GB.
Expert load imbalance without EPLB — without --enable-eplb, hot experts get overloaded when certain token types dominate your traffic (common with long reasoning chains). Enable it and set a reasonable --eplb-rebalance-interval (default 1000 steps is fine to start).
TTFT not improving despite disaggregation — check that your prefill workers aren’t also processing decode requests. The proxy’s routing logic must send all non-first tokens exclusively to decode workers. Monitor vllm:num_requests_running per worker to verify.
Related Guides
- How to Optimize LLM Serving with KV Cache and PagedAttention
- How to Scale ML Training and Inference with Ray
- How to Build a Model Inference Cache with Redis and Semantic Hashing
- How to Speed Up LLM Inference with Speculative Decoding
- How to Build a Model Serving Cluster with Ray Serve and Docker
- How to Set Up Multi-GPU Training with PyTorch
- How to Optimize Docker Images for ML Model Serving
- How to Monitor GPU Utilization and Debug Training Bottlenecks
- How to Quantize LLMs with GPTQ and AWQ
- How to Speed Up Training with Mixed Precision and PyTorch AMP