Futuristic illustration of speculative decoding: a large glowing neural tower represents the target LLM receiving rapid speculative token streams from a leaner drafting model below, with speed particle effects in electric orange and deep blue

Speculative Decoding: How Two LLMs Can Be Faster Than One

Introduction
The Problem: Autoregressive Decoding Is Fundamentally Slow
- The Autoregressive Bottleneck
- Why Post-Training Quantization Is Not Enough
How Speculative Decoding Works
- The Core Algorithm
- The Key Intuition
Technique Variants: EAGLE, Medusa, Draft Model
The Acceptance Rate Math: When Does It Break Even
- Empirical Benchmarks from Real Deployments
Production Deployment: EAGLE-3, vLLM, Cloud
Comparison: Speculative Decoding vs. Other Inference Speedup Techniques
Getting Started: 3 Ways to Enable Speculative Decoding Today
Conclusion & Next Steps

Introduction

!Speculative decoding architecture: draft model + target model parallel inference flow

In 2023, Google DeepMind published a paper that quietly defied one of the most fundamental assumptions of transformer-based AI: that generating text one token at a time is the only way to do it. They called the technique speculative sampling — and the research community quickly discovered its industrial-scale potential under the broader banner of speculative decoding.

Today, speculative decoding is a production-ready, open-source technology that every serious inference-serving team should understand. It achieves something that previously seemed contradictory: 2–3× speedups in LLM inference without sacrificing even a single iota of output quality. vLLM, TensorRT-LLM, Hugging Face text-generation-inference, and cloud providers have all shipped production-grade implementations. NVIDIA demonstrated 3.6× throughput improvements with their own variant (Arctic Inference) in December 2025.

The counterintuitive insight is simple: running two models is faster than running one. But the execution is elegant — a condensed version of what chief scientists do when they prep an assistant to do the obvious work while they handle the hard decisions.

This article traces the technique from its mathematical origins through the three major implementing families (draft-target, EAGLE, Medusa), the acceptance-rate math that governs speedups, production benchmarking data, and a concrete guide to deploying it in your own inference stack.

The Problem: Autoregressive Decoding Is Fundamentally Slow

To understand why speculative decoding works, you first have to appreciate just how bottlenecked standard LLM inference truly is.

The Autoregressive Bottleneck

Standard LLM generation is sequential by design. To produce the next token, the model performs a complete forward pass — loading every layer's weights from VRAM, computing attention over all previous tokens, projecting the final hidden state through the language-model head, and sampling the next token. Then repeat.

This means every token requires a full weight-load cycle. On modern GPUs, the arithmetic capability is enormous — but feeding those tensor cores requires pulling weights through a finite memory bus. The result: LLM inference is memory-bound, and the arithmetic units spend a majority of their time idle, waiting on VRAM reads.

This is not a hardware bug — it is an architectural consequence of autoregressive generation. Fixing it requires a different approach to the generation loop.

Why Post-Training Quantization Is Not Enough

Speedup techniques like INT4/INT8 quantization, GPTQ, and AWQ do reduce the per-forward-pass weight footprint, bringing more weights into cache per cycle. But they are fundamentally limited: they reduce the cost of each sequential step, but they do not change the number of sequential steps. To generate 100 tokens, you still need 100 forward passes — even if each pass is faster.

What you actually need is to produce multiple tokens per forward pass from the large, high-quality model. Speculative decoding does this without compromising the statistical guarantees of the output.

How Speculative Decoding Works

Speculative decoding (SD) operates on a simple but powerful draft-then-verify paradigm.

The Core Algorithm

Step 1: The draft mechanism proposes K tokens in advance
Step 2: The target model verifies ALL K draft tokens in a SINGLE parallel forward pass
Step 3: The longest accepted prefix of draft tokens is appended to the output
Step 4: The cycle repeats from the last accepted token

The mathematical guarantee is: the final output distribution is exactly identical to what the target model would generate on its own. There is no approximation, no loss of fidelity.

Here is the worked example from NVIDIA's implementation:

Input prefix: "The quick" Draft model proposes: brown → fox → hopped → over Target model verifies all 4 tokens in parallel:

brown = ✅ (P_target ≥ P_draft)

fox = ✅ (P_target ≥ P_draft)

hopped = ❌ (P_target << P_draft)

over = ❌ (discarded — first rejection)

Target model generates the corrected continuation from "The quick brown fox" → jumped → the → lazy → dog

Result: 2 accepted tokens out of 3 speculative rounds for a speedup of ~1 + α × γ, where α is acceptance rate and γ is the number of speculative tokens proposed per round.

The Key Intuition

Speedup = 1 / (1 - α × γ) where:

α (alpha) = fraction of draft tokens accepted by the target model, from 0.0 to 1.0
γ (gamma) = number of speculative tokens proposed per draft round

This formula reveals why draft model quality matters enormously:

At α = 0.8, γ = 5: speedup ≈ 1 / (1 - 0.8 × 5) = 1 / 0 = ∞ → effectively infinite (each round accepts all 5 = done)
More realistically α = 0.5, γ = 4: speedup ≈ 1 / (1 - 2.0) = −1 / 1 = 1.0× → no speedup (model poorly calibrated)

The name of the game is getting α as close to 1.0 as possible. This is why draft model design, not just deployment, is the critical lever.

Technique Variants: EAGLE, Medusa, Draft Model

There is no single way to implement speculative decoding. Each approach makes different tradeoffs between deployment complexity, speedup ceiling, and accuracy.

1. Draft Model (Classic)

The original formulation: a smaller, faster draft model (e.g. a distilled or quantized variant of the target, often 4–10× fewer parameters) runs autoregressively to propose γ tokens. The target model then verifies.

Pros: Simple to set up, well-studied, works with any model family
Cons: Draft and target distributions inevitably diverge, capping α around 0.5–0.7 in many real-world scenarios
Best for: General-purpose inference, high diversity tasks (creative writing, open-ended chat)

2. EAGLE (Extrapolation Algorithm for Greater Language-Model Efficiency)

EAGLE replaces the separate draft model with a lightweight prediction head attached directly to the target model's internal representations. It takes the hidden-state outputs from the target model's internal layers (before the LM head) and projects them — in a single forward pass — to an entire tree of candidate next tokens simultaneously.

EAGLE-3 adds multi-layer fused feature representations (low, middle, high-level embeddings)
Uses tree-based parallel verification — multiple token hypotheses explored simultaneously in a draft tree, then verified in one batch
No separate draft model required — uses the target model's own KV-cache and internal states

Results: EAGLE-3 achieves 3.0–6.5× speedup over vanilla autoregressive decoding and a 20–40% improvement over EAGLE-2 (arXiv 2503.01840).

: EAGLE Head Architecture
  ┌──────────────────────────────────────────┐
  │ Target Model (frozen weights)            │
  │  ... Layer 28: hidden_state extracted      │
  │  ... Layer 24: hidden_state extracted      │
  │  ... Layer 20: hidden_state extracted      │
  └──────────────┬───────────────────────────┘
                 │ multi-layer feature concat
        ┌────────▼─────────┐
        │   EAGLE Head     │  ← tiny, trainable (~few % of
        │  (linear + norm   │    target model parameters)
        │   + softmax LM)   │
        └────────┬──────────┘
                 │
          Draft tree of K tokens
                 │
        ┌────────▼──────────┐
        │ Target LM Head   │ ← converts hidden states → token probabilities
        └──────────────────┘
        Single forward pass = whole tree verified

Pros: Highest α in practice (0.7–0.9+), no separate model to serve, zero distribution mismatch
Cons: Requires attaching a head per target model and training it; head must be fine-tuned per model family
Best for: Production inference services where acceptance rate is the bottleneck

3. Medusa (Multi-Head Decoding)

Medusa takes a structurally different approach: instead of a separate draft mechanism, it adds extra prediction heads directly on top of a frozen LLM. Each head predicts a different future token position:

: Medusa Multi-Head Setup
  ┌──────────────────────────┐
  │   Base LLM (frozen)      │
  └──────────┬───────────────┘
             │ hidden state at position t
      ┌──────▼──────┐
      │ LM Head 0   │ → predicts token t+1
      │ LM Head 1   │ → predicts token t+2
      │ LM Head 2   │ → predicts token t+3
      │ LM Head 3   │ → predicts token t+4
      └──────┬───────┘
             │
       Draft tokens [t+1, t+2, t+3, t+4]
             │
      ┌──────▼────────────-┐
      │ Target LM verifies  │
      │ all in one pass     │
      └─────────────────────┘

Pros: Original model stays completely frozen, zero inference-time overhead for the drafting mechanism, Medusa heads are trivial to train
Cons: Acceptance rate tends to lag EAGLE on long-horizon tasks; Medusa heads underperform on complex reasoning chains
Best for: Teams that can fine-tune their target model, want zero inference overhead on draft generation

Note: Medusa predates EAGLE but remains a practical choice — especially for the open-source Medusa-2 bottle-neck architecture and the Hydra extension for sequential head dependency.

The Acceptance Rate Math: When Does It Break Even?

Speculative decoding is not free — each speculative round has a compute cost that must be recouped by avoided sequential forward passes. The breakeven is:

Net speedup > 1.0 when:  γ × α > 1

Where γ is the number of speculative tokens and α is the acceptance rate.

Empirical Benchmarks from Real Deployments

Using Llama-3.1-8B-Instruct as the target model with baseline E2E latency of 4,065 ms (from BentoML's patched-vLLM test):

Acceptance Rate (α)	γ = 3 Spec Tokens	γ = 5 Spec Tokens	γ = 7 Spec Tokens
α = 0.20	1.08×	~1.0× (no gain)	Worse
α = 0.40	1.33×	1.50×	~1.0×
α = 0.60	1.54×	2.13×	2.62×
α = 0.80	1.75×	2.86×	3.78×

The practical takeaway: a well-designed draft mechanism achieving α ≥ 0.6 with γ ≥ 5 is the minimum threshold for meaningful speedup. At α ≥ 0.8, dramatic 3×+ speedups are achievable. This is exactly why EAGLE's feature-level drafting converges so effectively — it avoids the distribution mismatch that keeps draft-model α capped around 0.5–0.65.

Production Deployment: EAGLE-3, vLLM, Cloud

vLLM: Production Default

vLLM v0.8.4+ ships with 7 built-in speculation methods:

vllm serve meta-llama/Llama-3-8B-Instruct \
  --speculative-config '{
    "method": "eagle3",
    "num_speculative_tokens": 7
  }'

The built-in --speculative-config flag handles all the KV-cache bookkeeping, tree attention, and rejection sampling internally. Extension to other frameworks (HuggingFace TGI, SGLang) follows the same pattern — a single config toggle brings 2–3× speedup.

NVIDIA Arctic Inference

Pushing EAGLE further, NVIDIA's Arctic Inference delivers the highest measured speculative decoding throughput on NVIDIA GPUs, specifically optimized under TensorRT-LLM and vLLM's deferral mechanism. Benchmarks on Llama-3.1-70B show Arctic Inference achieving a 3.6× throughput gain over standard autoregressive decoding (vLLM v0.8.5+ feature).

AWS Trainium

AWS published results running speculative decoding on their custom Trainium accelerators with vLLM. Key finding: for decode-heavy workloads (typical of chatbot-style generation with moderate context windows), speculative decoding reduced total latency per request by a factor of 1.4–1.8 across the tested model family, while maintaining ∼100% target model accuracy.

Draft Model Training

For draft-model-style speculative decoding, the draft model should ideally share the same architecture and tokenizer as the target. The BentoML team found that training a custom draft model specifically matched to the inference workload distribution yielded dramatically higher acceptance rates vs. generic out-of-the-box draft models — up to a 3× speedup vs. sequential baselines vs. only 1.8–2.0× with generic drafters.

Comparison: Speculative Decoding vs. Other Inference Speedup Techniques

: Speedup technique comparison
┌──────────────────────────────────┬──────────────┬──────────────┐
│  Technique                       │  Speedup     │  Cost        │
├──────────────────────────────────┼──────────────┼──────────────┤
│  INT4/INT8 quantization         │  1.2–1.5×    │  ✓ Free      │
│  GPTQ / AWQ (4-bit)             │  1.5–2.0×    │  ✓ Free      │
│  KV-cache quantization (KVCache) │  1.1–1.3×    │  ✓ Free      │
│  Continuous batching (vLLM)      │  1.5–5.0×    │  ✓ Free      │
│  Speculative decoding (EAGLE)    │  1.5–6.5×    │  1–2% params │
│  Speculative decoding (EAGLE-3)  │  2.0–6.5×    │  1–2% params │
│  **BitNet b1.58 (1.58-bit)**     │  2–5×        │  ⚠ Retrain   │
│  Distillation (TinyLlama, etc.)  │  1.0–1.3×    │  High cost   │
└──────────────────────────────────┴──────────────┴──────────────┘

Note: BitNet b1.58 achieves similar raw throughput from a completely different mechanism — dropping from FP16 weights to ternary {-1, 0, +1}. It covers the efficiency dimension fully, while speculative decoding covers the speed-per-token dimension. They are not alternatives; they complement each other.

	Speculative Decoding	Quantization	KV-cache Quant	Continuous Batching
Mechanism	Draft + verify	Reduce precision	Compress cache	Batch multiple requests
When it helps most	Low-interaction (chat, summarization)	All inference	Any inference	High-concurrency serving
Accuracy loss	Zero	0–5% typical	0–2%	Zero
Memory overhead	Draft model (~1–5%)	None	None	None
Setup complexity	Low–Medium	Low	Low	Low (vLLM)
Combinable	✅ With all others	✅ With all others	✅ With all others	✅ With all others

Getting Started: 3 Ways to Enable Speculative Decoding Today

Option 1: vLLM (EAGLE, zero config beyond flag)

## Standard vLLM auto-selects EAGLE-3 if available for the model
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --speculative-config '{"method": "eagle3", "num_speculative_tokens": 5}'

Benchmark with the built-in script:

python3 examples/features/speculative_decoding/spec_decode_offline.py \
  --model meta-llama/Llama-3.1-8B-Instruct

Option 2: TensorRT-LLM (EAGLE + Arctic Inference)

from tensorrt_llm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    speculative_config={
        "model": "sg2018/EAGLE-llama3.1-8B",
        "method": "eagle",
        "num_speculative_tokens": 5,
    }
)

Option 3: HuggingFace Text Generation Inference (Medusa via Offload)

TGI supports speculative decoding out of the box via the draft_model parameter:

text-generation-server --model-id meta-llama/Llama-3-8B-Instruct \
  --speculate draft-model:meta-llama/Llama-3-8B-Instruct \
  --speculate-max 5

Conclusion & Next Steps

Speculative decoding is one of those rare techniques where the theory is elegant and the engineering payoff is real and immediate. At this stage in 2025, it has:

✅ Theoretical guarantee: output is mathematically identical to vanilla autoregressive generation
✅ Multiple production implementations: vLLM, TensorRT-LLM, TGI, and cloud-native stacks
✅ 3.0–6.5× real-world speedups in benchmarked production deployments
✅ Zero quality cost: accepts only target-model-verified tokens; output is guaranteed correct
✅ Combinability: stacks with quantization, KV-cache optimization, and continuous batching for compounded speedups

The practical advice for any team serving LLMs today: decouple drafting from generation, choose a fast pathway (EAGLE-3 or Medusa for first-class model families, N-gram or suffix for zero-overhead cases), and benchmark your own α on your own workload — because the theoretical speedup numbers are only as good as your real acceptance rate.

What you can do today:

🚀 Enable vLLM speculative decoding: One flag, zero code changes, 2–3× speedup on any supported model in minutes
📊 Benchmark yourself: python3 examples/features/speculative_decoding/spec_decode_offline.py — measure α, TPS, and E2E latency on your workload
📚 Read the foundational papers: Speculative Sampling (DeepMind, 2023) · EAGLE (2024) · EAGLE-3 (2025) · Medusa (2024)
🔧 Fine-tune a Medusa head: If you own your target model and can run a few training epochs, Medusa gives you a zero-overhead draft mechanism with no separate serving cost

Speculative decoding is not a future technology. It is here, in your inference stack, ready to be a single flag away from turning 100 tokens-per-second into 300. The engineers who enable it first will deploy cheaper, run cooler, and serve happier users — without compromising a single point of accuracy.

Speculative Decoding: How Two LLMs Can Be Faster Than One

Key Takeaways

Speculative Decoding: How Two LLMs Can Be Faster Than One

Table of Contents

Introduction

The Problem: Autoregressive Decoding Is Fundamentally Slow

The Autoregressive Bottleneck

Why Post-Training Quantization Is Not Enough

How Speculative Decoding Works

The Core Algorithm

The Key Intuition

Technique Variants: EAGLE, Medusa, Draft Model

1. Draft Model (Classic)

2. EAGLE (Extrapolation Algorithm for Greater Language-Model Efficiency)

3. Medusa (Multi-Head Decoding)

The Acceptance Rate Math: When Does It Break Even?

Empirical Benchmarks from Real Deployments

Production Deployment: EAGLE-3, vLLM, Cloud

vLLM: Production Default

NVIDIA Arctic Inference

AWS Trainium

Draft Model Training

Comparison: Speculative Decoding vs. Other Inference Speedup Techniques

Getting Started: 3 Ways to Enable Speculative Decoding Today

Option 1: vLLM (EAGLE, zero config beyond flag)

Option 2: TensorRT-LLM (EAGLE + Arctic Inference)

Option 3: HuggingFace Text Generation Inference (Medusa via Offload)

Conclusion & Next Steps

Related Posts

8 Open-Source AI Tools You Missed This Week

OpenAI's 'Dreaming V3' — ChatGPT Finally Has Persistent Memory

Claude Fable 5: Anthropic Brings Mythos-Class Intelligence to the Public