Speculative Decoding: How Two LLMs Can Be Faster Than One
Table of Contents
- Introduction
- The Problem: Autoregressive Decoding Is Fundamentally Slow
- How Speculative Decoding Works
- Technique Variants: EAGLE, Medusa, Draft Model
- The Acceptance Rate Math: When Does It Break Even
- Production Deployment: EAGLE-3, vLLM, Cloud
- Comparison: Speculative Decoding vs. Other Inference Speedup Techniques
- Getting Started: 3 Ways to Enable Speculative Decoding Today
- Conclusion & Next Steps
Introduction
!Speculative decoding architecture: draft model + target model parallel inference flow
In 2023, Google DeepMind published a paper that quietly defied one of the most fundamental assumptions of transformer-based AI: that generating text one token at a time is the only way to do it. They called the technique speculative sampling — and the research community quickly discovered its industrial-scale potential under the broader banner of speculative decoding.
Today, speculative decoding is a production-ready, open-source technology that every serious inference-serving team should understand. It achieves something that previously seemed contradictory: 2–3× speedups in LLM inference without sacrificing even a single iota of output quality. vLLM, TensorRT-LLM, Hugging Face text-generation-inference, and cloud providers have all shipped production-grade implementations. NVIDIA demonstrated 3.6× throughput improvements with their own variant (Arctic Inference) in December 2025.
The counterintuitive insight is simple: running two models is faster than running one. But the execution is elegant — a condensed version of what chief scientists do when they prep an assistant to do the obvious work while they handle the hard decisions.
This article traces the technique from its mathematical origins through the three major implementing families (draft-target, EAGLE, Medusa), the acceptance-rate math that governs speedups, production benchmarking data, and a concrete guide to deploying it in your own inference stack.
The Problem: Autoregressive Decoding Is Fundamentally Slow
To understand why speculative decoding works, you first have to appreciate just how bottlenecked standard LLM inference truly is.
The Autoregressive Bottleneck
Standard LLM generation is sequential by design. To produce the next token, the model performs a complete forward pass — loading every layer's weights from VRAM, computing attention over all previous tokens, projecting the final hidden state through the language-model head, and sampling the next token. Then repeat.
This means every token requires a full weight-load cycle. On modern GPUs, the arithmetic capability is enormous — but feeding those tensor cores requires pulling weights through a finite memory bus. The result: LLM inference is memory-bound, and the arithmetic units spend a majority of their time idle, waiting on VRAM reads.
This is not a hardware bug — it is an architectural consequence of autoregressive generation. Fixing it requires a different approach to the generation loop.
Why Post-Training Quantization Is Not Enough
Speedup techniques like INT4/INT8 quantization, GPTQ, and AWQ do reduce the per-forward-pass weight footprint, bringing more weights into cache per cycle. But they are fundamentally limited: they reduce the cost of each sequential step, but they do not change the number of sequential steps. To generate 100 tokens, you still need 100 forward passes — even if each pass is faster.
What you actually need is to produce multiple tokens per forward pass from the large, high-quality model. Speculative decoding does this without compromising the statistical guarantees of the output.
How Speculative Decoding Works
Speculative decoding (SD) operates on a simple but powerful draft-then-verify paradigm.
The Core Algorithm
Step 1: The draft mechanism proposes K tokens in advance
Step 2: The target model verifies ALL K draft tokens in a SINGLE parallel forward pass
Step 3: The longest accepted prefix of draft tokens is appended to the output
Step 4: The cycle repeats from the last accepted tokenThe mathematical guarantee is: the final output distribution is exactly identical to what the target model would generate on its own. There is no approximation, no loss of fidelity.
Here is the worked example from NVIDIA's implementation:
Input prefix:
"The quick"Draft model proposes:brown→fox→hopped→overTarget model verifies all 4 tokens in parallel:
brown= ✅ (P_target ≥ P_draft)fox= ✅ (P_target ≥ P_draft)hopped= ❌ (P_target << P_draft)over= ❌ (discarded — first rejection)Target model generates the corrected continuation from
"The quick brown fox"→jumped→the→lazy→dog
Result: 2 accepted tokens out of 3 speculative rounds for a speedup of ~1 + α × γ, where α is acceptance rate and γ is the number of speculative tokens proposed per round.
The Key Intuition
Speedup = 1 / (1 - α × γ) where:
- α (alpha) = fraction of draft tokens accepted by the target model, from 0.0 to 1.0
- γ (gamma) = number of speculative tokens proposed per draft round
This formula reveals why draft model quality matters enormously:
- At α = 0.8, γ = 5: speedup ≈
1 / (1 - 0.8 × 5) = 1 / 0 = ∞→ effectively infinite (each round accepts all 5 = done) - More realistically α = 0.5, γ = 4: speedup ≈
1 / (1 - 2.0) = −1 / 1 = 1.0×→ no speedup (model poorly calibrated)
The name of the game is getting α as close to 1.0 as possible. This is why draft model design, not just deployment, is the critical lever.
Technique Variants: EAGLE, Medusa, Draft Model
There is no single way to implement speculative decoding. Each approach makes different tradeoffs between deployment complexity, speedup ceiling, and accuracy.
1. Draft Model (Classic)
The original formulation: a smaller, faster draft model (e.g. a distilled or quantized variant of the target, often 4–10× fewer parameters) runs autoregressively to propose γ tokens. The target model then verifies.
- Pros: Simple to set up, well-studied, works with any model family
- Cons: Draft and target distributions inevitably diverge, capping α around 0.5–0.7 in many real-world scenarios
- Best for: General-purpose inference, high diversity tasks (creative writing, open-ended chat)
2. EAGLE (Extrapolation Algorithm for Greater Language-Model Efficiency)
EAGLE replaces the separate draft model with a lightweight prediction head attached directly to the target model's internal representations. It takes the hidden-state outputs from the target model's internal layers (before the LM head) and projects them — in a single forward pass — to an entire tree of candidate next tokens simultaneously.
- EAGLE-3 adds multi-layer fused feature representations (low, middle, high-level embeddings)
- Uses tree-based parallel verification — multiple token hypotheses explored simultaneously in a draft tree, then verified in one batch
- No separate draft model required — uses the target model's own KV-cache and internal states
Results: EAGLE-3 achieves 3.0–6.5× speedup over vanilla autoregressive decoding and a 20–40% improvement over EAGLE-2 (arXiv 2503.01840).
: EAGLE Head Architecture
┌──────────────────────────────────────────┐
│ Target Model (frozen weights) │
│ ... Layer 28: hidden_state extracted │
│ ... Layer 24: hidden_state extracted │
│ ... Layer 20: hidden_state extracted │
└──────────────┬───────────────────────────┘
│ multi-layer feature concat
┌────────▼─────────┐
│ EAGLE Head │ ← tiny, trainable (~few % of
│ (linear + norm │ target model parameters)
│ + softmax LM) │
└────────┬──────────┘
│
Draft tree of K tokens
│
┌────────▼──────────┐
│ Target LM Head │ ← converts hidden states → token probabilities
└──────────────────┘
Single forward pass = whole tree verified- Pros: Highest α in practice (0.7–0.9+), no separate model to serve, zero distribution mismatch
- Cons: Requires attaching a head per target model and training it; head must be fine-tuned per model family
- Best for: Production inference services where acceptance rate is the bottleneck
3. Medusa (Multi-Head Decoding)
Medusa takes a structurally different approach: instead of a separate draft mechanism, it adds extra prediction heads directly on top of a frozen LLM. Each head predicts a different future token position:
: Medusa Multi-Head Setup
┌──────────────────────────┐
│ Base LLM (frozen) │
└──────────┬───────────────┘
│ hidden state at position t
┌──────▼──────┐
│ LM Head 0 │ → predicts token t+1
│ LM Head 1 │ → predicts token t+2
│ LM Head 2 │ → predicts token t+3
│ LM Head 3 │ → predicts token t+4
└──────┬───────┘
│
Draft tokens [t+1, t+2, t+3, t+4]
│
┌──────▼────────────-┐
│ Target LM verifies │
│ all in one pass │
└─────────────────────┘- Pros: Original model stays completely frozen, zero inference-time overhead for the drafting mechanism, Medusa heads are trivial to train
- Cons: Acceptance rate tends to lag EAGLE on long-horizon tasks; Medusa heads underperform on complex reasoning chains
- Best for: Teams that can fine-tune their target model, want zero inference overhead on draft generation
Note: Medusa predates EAGLE but remains a practical choice — especially for the open-source Medusa-2 bottle-neck architecture and the Hydra extension for sequential head dependency.
The Acceptance Rate Math: When Does It Break Even?
Speculative decoding is not free — each speculative round has a compute cost that must be recouped by avoided sequential forward passes. The breakeven is:
Net speedup > 1.0 when: γ × α > 1Where γ is the number of speculative tokens and α is the acceptance rate.
Empirical Benchmarks from Real Deployments
Using Llama-3.1-8B-Instruct as the target model with baseline E2E latency of 4,065 ms (from BentoML's patched-vLLM test):
The practical takeaway: a well-designed draft mechanism achieving α ≥ 0.6 with γ ≥ 5 is the minimum threshold for meaningful speedup. At α ≥ 0.8, dramatic 3×+ speedups are achievable. This is exactly why EAGLE's feature-level drafting converges so effectively — it avoids the distribution mismatch that keeps draft-model α capped around 0.5–0.65.
Production Deployment: EAGLE-3, vLLM, Cloud
vLLM: Production Default
vLLM v0.8.4+ ships with 7 built-in speculation methods:
vllm serve meta-llama/Llama-3-8B-Instruct \
--speculative-config '{
"method": "eagle3",
"num_speculative_tokens": 7
}'The built-in --speculative-config flag handles all the KV-cache bookkeeping, tree attention, and rejection sampling internally. Extension to other frameworks (HuggingFace TGI, SGLang) follows the same pattern — a single config toggle brings 2–3× speedup.
NVIDIA Arctic Inference
Pushing EAGLE further, NVIDIA's Arctic Inference delivers the highest measured speculative decoding throughput on NVIDIA GPUs, specifically optimized under TensorRT-LLM and vLLM's deferral mechanism. Benchmarks on Llama-3.1-70B show Arctic Inference achieving a 3.6× throughput gain over standard autoregressive decoding (vLLM v0.8.5+ feature).
AWS Trainium
AWS published results running speculative decoding on their custom Trainium accelerators with vLLM. Key finding: for decode-heavy workloads (typical of chatbot-style generation with moderate context windows), speculative decoding reduced total latency per request by a factor of 1.4–1.8 across the tested model family, while maintaining ∼100% target model accuracy.
Draft Model Training
For draft-model-style speculative decoding, the draft model should ideally share the same architecture and tokenizer as the target. The BentoML team found that training a custom draft model specifically matched to the inference workload distribution yielded dramatically higher acceptance rates vs. generic out-of-the-box draft models — up to a 3× speedup vs. sequential baselines vs. only 1.8–2.0× with generic drafters.
Comparison: Speculative Decoding vs. Other Inference Speedup Techniques
: Speedup technique comparison
┌──────────────────────────────────┬──────────────┬──────────────┐
│ Technique │ Speedup │ Cost │
├──────────────────────────────────┼──────────────┼──────────────┤
│ INT4/INT8 quantization │ 1.2–1.5× │ ✓ Free │
│ GPTQ / AWQ (4-bit) │ 1.5–2.0× │ ✓ Free │
│ KV-cache quantization (KVCache) │ 1.1–1.3× │ ✓ Free │
│ Continuous batching (vLLM) │ 1.5–5.0× │ ✓ Free │
│ Speculative decoding (EAGLE) │ 1.5–6.5× │ 1–2% params │
│ Speculative decoding (EAGLE-3) │ 2.0–6.5× │ 1–2% params │
│ **BitNet b1.58 (1.58-bit)** │ 2–5× │ ⚠ Retrain │
│ Distillation (TinyLlama, etc.) │ 1.0–1.3× │ High cost │
└──────────────────────────────────┴──────────────┴──────────────┘Note: BitNet b1.58 achieves similar raw throughput from a completely different mechanism — dropping from FP16 weights to ternary {-1, 0, +1}. It covers the efficiency dimension fully, while speculative decoding covers the speed-per-token dimension. They are not alternatives; they complement each other.
Getting Started: 3 Ways to Enable Speculative Decoding Today
Option 1: vLLM (EAGLE, zero config beyond flag)
## Standard vLLM auto-selects EAGLE-3 if available for the model
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--speculative-config '{"method": "eagle3", "num_speculative_tokens": 5}'Benchmark with the built-in script:
python3 examples/features/speculative_decoding/spec_decode_offline.py \
--model meta-llama/Llama-3.1-8B-InstructOption 2: TensorRT-LLM (EAGLE + Arctic Inference)
from tensorrt_llm import LLM, SamplingParams
llm = LLM(
model="meta-llama/Llama-3.1-8B-Instruct",
speculative_config={
"model": "sg2018/EAGLE-llama3.1-8B",
"method": "eagle",
"num_speculative_tokens": 5,
}
)Option 3: HuggingFace Text Generation Inference (Medusa via Offload)
TGI supports speculative decoding out of the box via the draft_model parameter:
text-generation-server --model-id meta-llama/Llama-3-8B-Instruct \
--speculate draft-model:meta-llama/Llama-3-8B-Instruct \
--speculate-max 5Conclusion & Next Steps
Speculative decoding is one of those rare techniques where the theory is elegant and the engineering payoff is real and immediate. At this stage in 2025, it has:
- ✅ Theoretical guarantee: output is mathematically identical to vanilla autoregressive generation
- ✅ Multiple production implementations: vLLM, TensorRT-LLM, TGI, and cloud-native stacks
- ✅ 3.0–6.5× real-world speedups in benchmarked production deployments
- ✅ Zero quality cost: accepts only target-model-verified tokens; output is guaranteed correct
- ✅ Combinability: stacks with quantization, KV-cache optimization, and continuous batching for compounded speedups
The practical advice for any team serving LLMs today: decouple drafting from generation, choose a fast pathway (EAGLE-3 or Medusa for first-class model families, N-gram or suffix for zero-overhead cases), and benchmark your own α on your own workload — because the theoretical speedup numbers are only as good as your real acceptance rate.
What you can do today:
- 🚀 Enable vLLM speculative decoding: One flag, zero code changes, 2–3× speedup on any supported model in minutes
- 📊 Benchmark yourself:
python3 examples/features/speculative_decoding/spec_decode_offline.py— measure α, TPS, and E2E latency on your workload - 📚 Read the foundational papers: Speculative Sampling (DeepMind, 2023) · EAGLE (2024) · EAGLE-3 (2025) · Medusa (2024)
- 🔧 Fine-tune a Medusa head: If you own your target model and can run a few training epochs, Medusa gives you a zero-overhead draft mechanism with no separate serving cost
Speculative decoding is not a future technology. It is here, in your inference stack, ready to be a single flag away from turning 100 tokens-per-second into 300. The engineers who enable it first will deploy cheaper, run cooler, and serve happier users — without compromising a single point of accuracy.