Disaggregated serving — splitting LLM inference into separate prefill and decode worker pools — has moved from research paper to production necessity for DeepSeek R1 on NVIDIA Blackwell. The reason is straightforward: prefill is compute-bound and benefits from packing many tokens at high utilization, while decode is memory-bandwidth-bound and sensitive to per-token latency. Running both phases on the same GPUs forces constant compromise. Separating them lets you optimize each independently.

On GB200 hardware, vLLM’s disaggregated setup delivers 26.2K prefill tokens/GPU/second and 10.1K decode tokens/GPU/second — roughly 3–5x what you’d see on H200. At the software level, the latest optimizations added 38% more throughput at max-load and 13% better interactivity at minimum latency across the full Pareto curve.

Here’s how to configure it.

Why Disaggregated Serving Matters for DeepSeek R1

DeepSeek R1 is a 671B MoE model. Only a fraction of experts activate per token, but the KV cache grows large fast during prefill of long reasoning traces. The interaction between expert parallelism and KV cache transfer is the crux of why this matters.

In a standard (aggregated) deployment, when a long prefill request lands on a GPU that’s also mid-decode, the decode requests see their inter-token latency spike — the prefill junk interrupts their KV cache read pattern. Chunked prefill mitigates this, but doesn’t eliminate it.

Disaggregated serving routes all prefill to dedicated GPU pools and all decode to separate pools. The KV cache is transferred over NVLink or NCCL after prefill completes, then the decode worker picks up from there. On GB200 clusters where compute nodes share NVLink-C2C fabric, this transfer is fast enough that the latency penalty is dominated by the benefit.

Prerequisites and Installation

You need vLLM 0.8.0 or later. The disaggregated prefill feature graduated from experimental in 0.7.x.

1
2
3
4
5
6
7
8
9
# Install vLLM with CUDA 12.4 support
pip install vllm>=0.8.0

# Verify GPU visibility
python -c "import torch; print(torch.cuda.device_count(), torch.version.cuda)"

# Download DeepSeek R1 weights (671B — plan for ~340GB storage at FP8)
# Use HuggingFace Hub or a pre-cached model directory
huggingface-cli download deepseek-ai/DeepSeek-R1 --local-dir /models/deepseek-r1

For multi-node setups, all nodes need shared model weight storage (NFS or object storage mounted at the same path) and NCCL connectivity between them.

Core Disaggregated Setup: Single-Node Two-GPU

Start with the minimal setup — one GPU for prefill, one for decode — to validate the configuration before scaling up.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
#!/usr/bin/env bash
# disagg_serve.sh
# Single-node disaggregated prefill/decode for DeepSeek R1
# Requires: 2x GPUs (or adjust CUDA_VISIBLE_DEVICES per your node layout)

MODEL="/models/deepseek-r1"
PREFILL_PORT=8100
DECODE_PORT=8200
PROXY_PORT=8000

# --- Prefill Worker ---
CUDA_VISIBLE_DEVICES=0 vllm serve "$MODEL" \
  --host 0.0.0.0 \
  --port $PREFILL_PORT \
  --tensor-parallel-size 1 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.85 \
  --kv-transfer-config \
    '{"kv_connector":"P2pNcclConnector","kv_role":"kv_producer","kv_rank":0,"kv_parallel_size":2,"kv_buffer_size":"2e9","kv_port":14579}' \
  --trust-remote-code &

PREFILL_PID=$!

# --- Decode Worker ---
CUDA_VISIBLE_DEVICES=1 vllm serve "$MODEL" \
  --host 0.0.0.0 \
  --port $DECODE_PORT \
  --tensor-parallel-size 1 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.85 \
  --kv-transfer-config \
    '{"kv_connector":"P2pNcclConnector","kv_role":"kv_consumer","kv_rank":1,"kv_parallel_size":2,"kv_buffer_size":"2e9","kv_port":14579}' \
  --trust-remote-code &

DECODE_PID=$!

echo "Prefill PID: $PREFILL_PID on port $PREFILL_PORT"
echo "Decode PID: $DECODE_PID on port $DECODE_PORT"

# Wait for both servers to be ready
sleep 30

# --- Proxy Server ---
# Routes requests: send to prefill first (max_tokens=1), then route to decode
# Clone vLLM to get the proxy script:
#   git clone https://github.com/vllm-project/vllm
python3 vllm/examples/online_serving/disagg_prefill_proxy_server.py \
  --port $PROXY_PORT \
  --prefill-url "http://localhost:$PREFILL_PORT" \
  --decode-url "http://localhost:$DECODE_PORT" &

PROXY_PID=$!
echo "Proxy PID: $PROXY_PID on port $PROXY_PORT"

# Cleanup on exit
trap "kill $PREFILL_PID $DECODE_PID $PROXY_PID 2>/dev/null" EXIT
wait

Key parameters to understand:

  • kv_role: "kv_producer" — this worker runs prefill and ships KV cache to consumers
  • kv_role: "kv_consumer" — receives KV cache and runs decode
  • kv_buffer_size — buffer in bytes for in-flight KV transfers; 2e9 (2GB) is a safe starting point for DeepSeek R1’s layer count
  • kv_port — dedicated port for NCCL KV transfer traffic, keep it separate from the HTTP serving port

GB200 Optimized Configuration

For production GB200 deployments, you want tensor parallelism across all NVL-connected GPUs in each instance, FP4 quantization on MoE weights, and expert parallelism turned on. The benchmark numbers (26.2K prefill TPGS, 10.1K decode TPGS) used a 4-prefill-instance × 2-GB200 + 1-decode-instance × 8-GB200 layout.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
#!/usr/bin/env bash
# gb200_prefill_instance.sh
# Run on a node with 2x GB200 GPUs dedicated to prefill

MODEL="/models/deepseek-r1"
NODE_IP=$(hostname -I | awk '{print $1}')

CUDA_VISIBLE_DEVICES=0,1 vllm serve "$MODEL" \
  --host 0.0.0.0 \
  --port 8100 \
  --tensor-parallel-size 2 \
  --max-model-len 65536 \
  --gpu-memory-utilization 0.90 \
  --quantization fp8 \
  --enable-expert-parallel \
  --enable-eplb \
  --cuda-graph-capture-size 2048 \
  --kv-transfer-config \
    "{\"kv_connector\":\"P2pNcclConnector\",\"kv_role\":\"kv_producer\",\"kv_rank\":0,\"kv_parallel_size\":5,\"kv_buffer_size\":\"4e9\",\"kv_port\":14579}" \
  --trust-remote-code
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
#!/usr/bin/env bash
# gb200_decode_instance.sh
# Run on a node with 8x GB200 GPUs dedicated to decode

MODEL="/models/deepseek-r1"

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 vllm serve "$MODEL" \
  --host 0.0.0.0 \
  --port 8200 \
  --tensor-parallel-size 8 \
  --max-model-len 65536 \
  --gpu-memory-utilization 0.90 \
  --quantization fp8 \
  --enable-expert-parallel \
  --enable-eplb \
  --enable-dbo \
  --cuda-graph-capture-size 2048 \
  --kv-transfer-config \
    "{\"kv_connector\":\"P2pNcclConnector\",\"kv_role\":\"kv_consumer\",\"kv_rank\":4,\"kv_parallel_size\":5,\"kv_buffer_size\":\"4e9\",\"kv_port\":14579}" \
  --trust-remote-code

Flag breakdown:

  • --enable-expert-parallel — activates Wide-EP, distributes MoE experts across all TP ranks. Critical for DeepSeek R1’s 256-expert architecture
  • --enable-eplb — Expert Parallel Load Balancing; rebalances expert assignments based on observed activation frequency, preventing hot-expert bottlenecks
  • --enable-dbo — Dual Batch Overlap; overlaps compute with all-to-all collective communication during decode, hiding ~20% of dispatch latency
  • --quantization fp8 — uses FP8 for attention layers and FP4 for MoE expert weights when FlashInfer MXFP4 backends are enabled (set VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1)
  • --cuda-graph-capture-size 2048 — captures CUDA graphs up to batch size 2048; larger values reduce Python overhead but increase GPU memory at startup

Enabling FP4 MoE Kernels

GB200’s FP4 tensor cores are the hardware advantage that makes the throughput numbers possible. FP4 for MoE expert weights reduces memory bandwidth pressure by 2x vs FP8, and the FlashInfer TRTLLM-Gen GEMM kernels are specifically tuned for GB200’s tensor core layout.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# Add this to your environment before launching vllm serve
export VLLM_USE_FLASHINFER_MOE_MXFP4_MXFP8=1
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

# Then launch with fp8 quantization flag — vLLM will use FP4 for MoE layers
# and FP8 for attention layers automatically when this env var is set
vllm serve /models/deepseek-r1 \
  --quantization fp8 \
  --enable-expert-parallel \
  ...

On B200 and GB200 GPUs you should see MoE GEMM throughput roughly double compared to FP8-only mode. On older H100/H200 hardware this env var has no effect — FP4 tensor cores are Blackwell-exclusive.

Testing the Setup

Once the prefill worker, decode worker, and proxy are all running, test with a standard OpenAI-compatible request against the proxy port:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-required",  # vLLM doesn't enforce API keys by default
)

def test_inference(prompt: str, max_tokens: int = 512) -> str:
    """Send a test request through the disaggregated proxy."""
    response = client.chat.completions.create(
        model="deepseek-ai/DeepSeek-R1",  # Must match --served-model-name or model path
        messages=[
            {"role": "user", "content": prompt}
        ],
        max_tokens=max_tokens,
        temperature=0.6,
        stream=False,
    )
    return response.choices[0].message.content

# Quick sanity check
result = test_inference("What is the time complexity of quicksort in the average case? Explain briefly.")
print(result)

Verify the proxy is routing correctly by checking vLLM’s /metrics endpoint on each worker:

1
2
3
4
5
# Check prefill worker metrics
curl -s http://localhost:8100/metrics | grep -E "vllm:num_requests_running|vllm:gpu_cache_usage"

# Check decode worker metrics
curl -s http://localhost:8200/metrics | grep -E "vllm:num_requests_running|vllm:gpu_cache_usage"

You should see vllm:num_requests_running increment on the prefill worker when a request arrives, then drop to zero and increment on the decode worker once KV cache transfer completes.

Monitoring with Prometheus and Grafana

vLLM exposes Prometheus metrics at /metrics on each server. Wire both workers into your scrape config:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# prometheus.yml excerpt
scrape_configs:
  - job_name: "vllm-prefill"
    static_configs:
      - targets: ["prefill-node-ip:8100"]
    metrics_path: /metrics

  - job_name: "vllm-decode"
    static_configs:
      - targets: ["decode-node-ip:8200"]
    metrics_path: /metrics

The metrics to watch for disaggregated serving health:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# KV cache hit rate — low hit rate on decode means transfer latency is hurting you
vllm:gpu_prefix_cache_hit_rate

# Queue depth per worker — if prefill queue is backing up, add prefill instances
vllm:num_requests_waiting{job="vllm-prefill"}
vllm:num_requests_waiting{job="vllm-decode"}

# Time-to-first-token — captures prefill latency end-to-end
vllm:time_to_first_token_seconds

# Inter-token latency — pure decode speed
vllm:time_per_output_token_seconds

A healthy disaggregated deployment shows TTFT dominated by prefill compute time, ITL dominated by decode memory bandwidth, and neither queue backing up. If the decode queue grows while prefill stays empty, add decode instances. If prefill backs up, add prefill instances.

Scaling to Multi-Node with Ray Serve

For larger clusters, Ray Serve’s LLM API handles the orchestration layer — prefill/decode disaggregation, data parallel routing, and prefix cache affinity routing.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# ray_serve_disagg.py
import ray
from ray import serve
from ray.serve.llm import LLMConfig, build_openai_app

ray.init(address="auto")

prefill_config = LLMConfig(
    model_id="deepseek-ai/DeepSeek-R1",
    accelerator_type="GB200",
    tensor_parallel_size=2,
    pipeline_parallel_size=1,
    role="prefill",                        # Ray Serve LLM disagg role
    max_model_len=65536,
    engine_kwargs={
        "enable_expert_parallel": True,
        "enable_eplb": True,
        "quantization": "fp8",
        "gpu_memory_utilization": 0.90,
    },
)

decode_config = LLMConfig(
    model_id="deepseek-ai/DeepSeek-R1",
    accelerator_type="GB200",
    tensor_parallel_size=8,
    pipeline_parallel_size=1,
    role="decode",
    max_model_len=65536,
    engine_kwargs={
        "enable_expert_parallel": True,
        "enable_eplb": True,
        "enable_dbo": True,
        "quantization": "fp8",
        "gpu_memory_utilization": 0.90,
    },
)

# Build the OpenAI-compatible app with disaggregated routing
app = build_openai_app(
    {"prefill": prefill_config, "decode": decode_config},
    disaggregation=True,
)

serve.run(app, host="0.0.0.0", port=8000)

Ray Serve handles request routing automatically — prefill requests go to prefill replicas, KV cache transfers happen over Ray’s object store or direct NCCL, and decode picks up from there.

Common Pitfalls

KV transfer hanging at startup — both workers must be fully initialized before the proxy starts routing requests. Add a readiness check loop (GET /health) rather than a fixed sleep 30.

kv_rank conflicts — each worker in the NCCL group needs a unique kv_rank. In a 4-prefill + 1-decode setup, assign ranks 0–3 to prefill and rank 4 to decode, with kv_parallel_size: 5.

Out of memory during KV transfer — the kv_buffer_size must fit in GPU memory alongside the model weights and KV cache. On a 671B FP8 model, each GB200 holds ~140GB of weights, leaving ~40GB headroom. Keep kv_buffer_size under 10GB.

Expert load imbalance without EPLB — without --enable-eplb, hot experts get overloaded when certain token types dominate your traffic (common with long reasoning chains). Enable it and set a reasonable --eplb-rebalance-interval (default 1000 steps is fine to start).

TTFT not improving despite disaggregation — check that your prefill workers aren’t also processing decode requests. The proxy’s routing logic must send all non-first tokens exclusively to decode workers. Monitor vllm:num_requests_running per worker to verify.