Speculative decoding is the single best trick for cutting LLM latency without changing your model or losing a single token of quality. The core idea is dead simple: a tiny draft model guesses ahead, and the big model checks those guesses in one forward pass instead of generating tokens one at a time.

Here’s the fastest way to try it with Hugging Face Transformers:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the big target model and a small draft model
target = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct", device_map="auto")
draft = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B-Instruct", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

prompt = "Explain quicksort in Python step by step."
inputs = tokenizer(prompt, return_tensors="pt").to(target.device)

# Assisted generation = speculative decoding in HF
outputs = target.generate(
    **inputs,
    assistant_model=draft,
    max_new_tokens=256,
    do_sample=False,
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

That’s it. One extra argument (assistant_model) and you get 1.5-3x faster generation with mathematically identical output for greedy decoding.

How Speculative Decoding Actually Works

Standard autoregressive generation is painfully sequential. Each token requires a full forward pass through the big model, and most of that GPU compute sits idle because you’re memory-bandwidth bound, not compute bound.

Speculative decoding flips this on its head with a two-phase loop:

  1. Draft phase – A small, fast model generates K candidate tokens autoregressively. This is cheap because the draft model is tiny (1-2B parameters vs your 8-70B target).

  2. Verification phase – The big model runs a single forward pass on all K draft tokens simultaneously. It checks each token against its own probability distribution and accepts tokens that match. The first token that disagrees gets replaced with the big model’s choice, and everything after it gets thrown away.

The key insight: verifying K tokens in parallel costs roughly the same as generating one token, because the bottleneck is loading model weights into GPU cache, not the actual matrix multiplications. You’re trading cheap draft computation for expensive sequential steps in the target model.

When the draft model’s acceptance rate is high (meaning it guesses correctly most of the time), you effectively generate multiple tokens per target model forward pass. A good draft model on code or structured text can hit 70-85% acceptance rates, giving you 2-3x speedups.

Implementing with vLLM for Production

Hugging Face’s assisted_generation works great for experimentation, but for production serving, vLLM’s speculative decoding is significantly faster. It integrates directly into the continuous batching engine.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Install vLLM
pip install vllm

# Start vLLM server with speculative decoding enabled
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --speculative-model meta-llama/Llama-3.2-1B-Instruct \
    --num-speculative-tokens 5 \
    --gpu-memory-utilization 0.90 \
    --max-model-len 4096 \
    --port 8000

Now query it like any OpenAI-compatible endpoint:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
import openai

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[{"role": "user", "content": "Write a Python function to merge two sorted lists."}],
    max_tokens=512,
    temperature=0,
)

print(response.choices[0].message.content)

The --num-speculative-tokens flag controls how many tokens the draft model proposes per step. 5 is a solid default. Going higher increases potential speedup but also increases wasted compute when tokens get rejected.

Choosing Your Draft Model

This is where most people mess up. The draft model choice makes or breaks your speedup. Here’s what actually matters:

Use the same model family. A Llama 3.2 1B draft for a Llama 3.1 8B target works far better than a Qwen 1.5B draft, because same-family models share vocabulary and tend to agree on token distributions. Mismatched vocabularies mean the draft model’s guesses get rejected constantly.

Smaller is better, to a point. The draft model needs to be fast enough that drafting K tokens is cheaper than one target forward pass. A 1B draft for an 8B target is ideal. A 3B draft for an 8B target barely helps – the draft model is too slow relative to the savings.

My recommended pairings:

Target ModelDraft ModelExpected Speedup
Llama 3.1 8BLlama 3.2 1B1.8-2.5x
Llama 3.1 70BLlama 3.1 8B2.0-3.0x
Mistral 7BMistral-tiny (custom distill)1.5-2.0x
CodeLlama 34BCodeLlama 7B2.5-3.5x (code tasks)

Code generation sees the highest speedups because code is highly predictable – the draft model nails most tokens.

Benchmarking Your Speedup

Don’t trust vibes. Measure actual tokens per second with and without speculative decoding on your real workload.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import time
from transformers import AutoModelForCausalLM, AutoTokenizer

target = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct", device_map="auto")
draft = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B-Instruct", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")

prompts = [
    "Write a Python class for a binary search tree with insert and search methods.",
    "Explain the difference between TCP and UDP in detail.",
    "What are the key principles of distributed systems?",
]

for prompt in prompts:
    inputs = tokenizer(prompt, return_tensors="pt").to(target.device)

    # Baseline: standard generation
    start = time.perf_counter()
    out_base = target.generate(**inputs, max_new_tokens=256, do_sample=False)
    base_time = time.perf_counter() - start
    base_tokens = out_base.shape[1] - inputs["input_ids"].shape[1]

    # Speculative decoding
    start = time.perf_counter()
    out_spec = target.generate(**inputs, assistant_model=draft, max_new_tokens=256, do_sample=False)
    spec_time = time.perf_counter() - start
    spec_tokens = out_spec.shape[1] - inputs["input_ids"].shape[1]

    print(f"Prompt: {prompt[:50]}...")
    print(f"  Baseline: {base_tokens/base_time:.1f} tok/s ({base_time:.2f}s)")
    print(f"  Speculative: {spec_tokens/spec_time:.1f} tok/s ({spec_time:.2f}s)")
    print(f"  Speedup: {base_time/spec_time:.2f}x")
    print()

Run this on a representative sample of your actual queries. Speedups vary wildly between tasks – factual Q&A might see 1.5x while code generation hits 3x on the same model pair.

When Speculative Decoding Hurts

Speculative decoding is not a free lunch. It actively makes things worse in certain scenarios:

High-entropy generation. Creative writing, brainstorming, or anything with high temperature sampling tanks acceptance rates. The draft model’s guesses diverge from what the target would pick when there are many plausible next tokens. If your acceptance rate drops below 40%, you’re paying for draft computation and getting almost nothing back.

Short outputs. If you’re generating 10-20 tokens (classification, short answers), the overhead of loading the draft model and running the verification loop outweighs any savings.

Batch inference at high throughput. When you’re already saturating GPU compute with large batches, speculative decoding adds memory pressure from the draft model without helping much. The target model is already compute-bound rather than memory-bound, so the fundamental premise breaks down.

Limited GPU memory. The draft model needs to live in GPU memory alongside the target. For a 70B model on 2x A100 80GB, squeezing in an 8B draft model might force you to reduce batch size or context length.

My rule of thumb: use speculative decoding when you’re serving a single user or small batch with latency as the priority. Skip it when you’re maximizing throughput across many concurrent requests.

Common Errors

RuntimeError: Expected all tensors to be on the same device

The draft and target models landed on different GPUs. Force them onto the same device or use matching device_map strategies:

1
2
3
4
5
6
# Force both models to the same device
target = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct", device_map="auto")
draft = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-1B-Instruct",
    device_map=target.hf_device_map,  # match target's device placement
)

ValueError: assistant_model and model must use the same tokenizer

You picked a draft model with a different vocabulary. Stick to the same model family. If you must use a different family, check that tokenizer.vocab_size matches between the two.

torch.cuda.OutOfMemoryError after enabling speculative decoding

The draft model pushed you over your VRAM budget. Options: use a quantized draft model (load_in_4bit=True), reduce --gpu-memory-utilization in vLLM, or pick a smaller draft model. A 500M parameter draft still helps.

vLLM: Speculative decoding is not supported with this model architecture

Not all model architectures support speculative decoding in vLLM yet. Check the vLLM docs for supported model pairs. As of early 2026, Llama, Mistral, and Qwen families all work. Mixtral MoE models have limited support.

Low acceptance rate (below 50%) with no speedup

Your draft model is a poor match for the target. Try: (1) a draft model from the same family, (2) reducing --num-speculative-tokens to 3, or (3) fine-tuning a small model on your target model’s outputs using knowledge distillation.