Speculative decoding is the single best trick for cutting LLM latency without changing your model or losing a single token of quality. The core idea is dead simple: a tiny draft model guesses ahead, and the big model checks those guesses in one forward pass instead of generating tokens one at a time.
Here’s the fastest way to try it with Hugging Face Transformers:
| |
That’s it. One extra argument (assistant_model) and you get 1.5-3x faster generation with mathematically identical output for greedy decoding.
How Speculative Decoding Actually Works
Standard autoregressive generation is painfully sequential. Each token requires a full forward pass through the big model, and most of that GPU compute sits idle because you’re memory-bandwidth bound, not compute bound.
Speculative decoding flips this on its head with a two-phase loop:
Draft phase – A small, fast model generates K candidate tokens autoregressively. This is cheap because the draft model is tiny (1-2B parameters vs your 8-70B target).
Verification phase – The big model runs a single forward pass on all K draft tokens simultaneously. It checks each token against its own probability distribution and accepts tokens that match. The first token that disagrees gets replaced with the big model’s choice, and everything after it gets thrown away.
The key insight: verifying K tokens in parallel costs roughly the same as generating one token, because the bottleneck is loading model weights into GPU cache, not the actual matrix multiplications. You’re trading cheap draft computation for expensive sequential steps in the target model.
When the draft model’s acceptance rate is high (meaning it guesses correctly most of the time), you effectively generate multiple tokens per target model forward pass. A good draft model on code or structured text can hit 70-85% acceptance rates, giving you 2-3x speedups.
Implementing with vLLM for Production
Hugging Face’s assisted_generation works great for experimentation, but for production serving, vLLM’s speculative decoding is significantly faster. It integrates directly into the continuous batching engine.
| |
Now query it like any OpenAI-compatible endpoint:
| |
The --num-speculative-tokens flag controls how many tokens the draft model proposes per step. 5 is a solid default. Going higher increases potential speedup but also increases wasted compute when tokens get rejected.
Choosing Your Draft Model
This is where most people mess up. The draft model choice makes or breaks your speedup. Here’s what actually matters:
Use the same model family. A Llama 3.2 1B draft for a Llama 3.1 8B target works far better than a Qwen 1.5B draft, because same-family models share vocabulary and tend to agree on token distributions. Mismatched vocabularies mean the draft model’s guesses get rejected constantly.
Smaller is better, to a point. The draft model needs to be fast enough that drafting K tokens is cheaper than one target forward pass. A 1B draft for an 8B target is ideal. A 3B draft for an 8B target barely helps – the draft model is too slow relative to the savings.
My recommended pairings:
| Target Model | Draft Model | Expected Speedup |
|---|---|---|
| Llama 3.1 8B | Llama 3.2 1B | 1.8-2.5x |
| Llama 3.1 70B | Llama 3.1 8B | 2.0-3.0x |
| Mistral 7B | Mistral-tiny (custom distill) | 1.5-2.0x |
| CodeLlama 34B | CodeLlama 7B | 2.5-3.5x (code tasks) |
Code generation sees the highest speedups because code is highly predictable – the draft model nails most tokens.
Benchmarking Your Speedup
Don’t trust vibes. Measure actual tokens per second with and without speculative decoding on your real workload.
| |
Run this on a representative sample of your actual queries. Speedups vary wildly between tasks – factual Q&A might see 1.5x while code generation hits 3x on the same model pair.
When Speculative Decoding Hurts
Speculative decoding is not a free lunch. It actively makes things worse in certain scenarios:
High-entropy generation. Creative writing, brainstorming, or anything with high temperature sampling tanks acceptance rates. The draft model’s guesses diverge from what the target would pick when there are many plausible next tokens. If your acceptance rate drops below 40%, you’re paying for draft computation and getting almost nothing back.
Short outputs. If you’re generating 10-20 tokens (classification, short answers), the overhead of loading the draft model and running the verification loop outweighs any savings.
Batch inference at high throughput. When you’re already saturating GPU compute with large batches, speculative decoding adds memory pressure from the draft model without helping much. The target model is already compute-bound rather than memory-bound, so the fundamental premise breaks down.
Limited GPU memory. The draft model needs to live in GPU memory alongside the target. For a 70B model on 2x A100 80GB, squeezing in an 8B draft model might force you to reduce batch size or context length.
My rule of thumb: use speculative decoding when you’re serving a single user or small batch with latency as the priority. Skip it when you’re maximizing throughput across many concurrent requests.
Common Errors
RuntimeError: Expected all tensors to be on the same device
The draft and target models landed on different GPUs. Force them onto the same device or use matching device_map strategies:
| |
ValueError: assistant_model and model must use the same tokenizer
You picked a draft model with a different vocabulary. Stick to the same model family. If you must use a different family, check that tokenizer.vocab_size matches between the two.
torch.cuda.OutOfMemoryError after enabling speculative decoding
The draft model pushed you over your VRAM budget. Options: use a quantized draft model (load_in_4bit=True), reduce --gpu-memory-utilization in vLLM, or pick a smaller draft model. A 500M parameter draft still helps.
vLLM: Speculative decoding is not supported with this model architecture
Not all model architectures support speculative decoding in vLLM yet. Check the vLLM docs for supported model pairs. As of early 2026, Llama, Mistral, and Qwen families all work. Mixtral MoE models have limited support.
Low acceptance rate (below 50%) with no speedup
Your draft model is a poor match for the target. Try: (1) a draft model from the same family, (2) reducing --num-speculative-tokens to 3, or (3) fine-tuning a small model on your target model’s outputs using knowledge distillation.
Related Guides
- How to Optimize LLM Serving with KV Cache and PagedAttention
- How to Use PyTorch FlexAttention for Fast LLM Inference
- How to Scale ML Training and Inference with Ray
- How to Build a Model Inference Queue with Celery and Redis
- How to Build a Model Inference Cost Tracking Pipeline with OpenTelemetry
- How to Deploy DeepSeek R1 on NVIDIA Blackwell with vLLM’s Disaggregated Serving
- How to Build a Model Inference Cache with Redis and Semantic Hashing
- How to Build a Model Artifact CDN with CloudFront and S3
- How to Quantize LLMs with GPTQ and AWQ
- How to Build a Model Serving Gateway with Envoy and gRPC