[Paper] Accelerate Speculative Decoding with Sparse Computation in Verification
Source: arXiv - 2512.21911v1
Overview
The paper tackles a hidden performance killer in speculative decoding—a technique that speeds up large language model (LLM) inference by letting the model “guess” several tokens at once and then verify them in parallel. While the guessing step is fast, the verification step can dominate the runtime, especially for long inputs or mixture‑of‑experts (MoE) models. The authors introduce a sparse verification framework that trims unnecessary computation in attention, feed‑forward, and MoE layers, delivering noticeable speed‑ups without sacrificing answer quality.
Key Contributions
- Systematic sparsification of verification: Adapts and evaluates several sparsity methods (structured pruning, top‑k selection, etc.) specifically for the verification phase of speculative decoding.
- Joint sparsity across model components: Simultaneously sparsifies attention, feed‑forward networks (FFNs), and MoE routing, uncovering redundancy that prior token‑wise sparsity work missed.
- Inter‑draft token & inter‑layer reuse: Reuses intermediate results between draft tokens and across transformer layers, cutting repeated work without extra training.
- Broad empirical validation: Experiments on summarization, QA, and math‑reasoning benchmarks show a favorable efficiency‑accuracy trade‑off and stable “acceptance length” (the number of draft tokens that pass verification).
Methodology
- Speculative Decoding Recap – The model first generates draft tokens using a fast, lightweight decoder. Then a verification pass runs the full LLM on the same context plus the draft tokens to confirm which ones are correct.
- Identifying Redundancy – The authors profile verification on long contexts and MoE models, finding that many attention heads, FFN neurons, and expert routes contribute little to the final logits for draft tokens.
- Sparse Verification Engine
- Attention sparsity: Keep only the top‑k keys/values per query (structured block‑sparsity) based on a cheap relevance score.
- FFN sparsity: Apply magnitude‑based pruning to hidden dimensions, re‑activating only the most influential neurons per layer.
- MoE sparsity: Limit the number of experts consulted per token (dynamic top‑k routing) and prune low‑weight expert parameters on‑the‑fly.
- Reuse Strategies
- Inter‑draft token reuse: Cache attention scores and intermediate activations that are identical across draft tokens, avoiding recomputation.
- Inter‑layer reuse: Propagate cached activations from earlier layers to later ones when the computation pattern repeats.
- No Extra Training Required – All sparsity decisions are made at inference time using lightweight heuristics, so the model can be deployed unchanged.
Results & Findings
| Task | Model | Baseline (Speculative) | Sparse Verification | Speed‑up* | Accuracy Δ |
|---|---|---|---|---|---|
| Summarization (XSum) | LLaMA‑7B | 1.8× over vanilla | 2.4× | +33% | –0.2 % ROUGE |
| QA (SQuAD) | MoE‑GLaM‑1.2B | 2.1× | 2.9× | +38% | –0.1 % EM |
| Math (MATH) | LLaMA‑13B | 1.6× | 2.2× | +38% | –0.3 % accuracy |
*Speed‑up measured as total inference time (draft + verification) relative to standard autoregressive decoding.
- Stable acceptance length: The number of draft tokens accepted per verification pass remains roughly unchanged, meaning the sparsity does not force the system to fall back to token‑by‑token decoding.
- Efficiency‑accuracy trade‑off: By adjusting the sparsity hyper‑parameters (e.g., top‑k size), developers can dial in the desired balance—higher speed with a modest drop in metric, or near‑full accuracy with modest gains.
Practical Implications
- Faster LLM APIs: Cloud providers can integrate sparse verification to cut latency for services that already use speculative decoding (e.g., chat assistants, code completion).
- Cost savings on MoE deployments: MoE models are notoriously expensive because routing many experts per token is heavy; sparsifying the verification step reduces GPU memory bandwidth and compute, lowering operational costs.
- Edge‑friendly inference: The approach requires only inference‑time heuristics, making it compatible with existing model checkpoints and hardware accelerators without retraining.
- Scalable to longer contexts: As prompts grow (e.g., document‑level summarization), verification becomes the bottleneck; sparse verification mitigates this, enabling real‑time performance even with 8‑16 k token windows.
Limitations & Future Work
- Heuristic sensitivity: The sparsity thresholds (top‑k values) are manually tuned; sub‑optimal settings can hurt accuracy, especially on tasks with fine‑grained reasoning.
- Hardware‑specific gains: The reported speed‑ups assume GPUs with efficient sparse kernels; on older hardware the benefit may be smaller.
- No training‑time sparsity: While avoiding retraining is a plus, the method cannot exploit model‑specific sparsity patterns that could be learned during fine‑tuning.
- Future directions: The authors suggest learning adaptive sparsity policies via reinforcement learning, extending the framework to multimodal models, and integrating with other inference accelerators (e.g., FlashAttention).
Authors
- Jikai Wang
- Jianchao Tan
- Yuxuan Hu
- Jiayu Qin
- Yerui Sun
- Yuchen Xie
- Xunliang Cai
- Juntao Li
- Min Zhang
Paper Information
- arXiv ID: 2512.21911v1
- Categories: cs.CL
- Published: December 26, 2025
- PDF: Download PDF