[Paper] Accelerate Speculative Decoding with Sparse Computation in Verification

Published: 1 month ago (December 26, 2025 at 02:53 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.21911v1

Overview

The paper tackles a hidden performance killer in speculative decoding—a technique that speeds up large language model (LLM) inference by letting the model “guess” several tokens at once and then verify them in parallel. While the guessing step is fast, the verification step can dominate the runtime, especially for long inputs or mixture‑of‑experts (MoE) models. The authors introduce a sparse verification framework that trims unnecessary computation in attention, feed‑forward, and MoE layers, delivering noticeable speed‑ups without sacrificing answer quality.

Key Contributions

Systematic sparsification of verification: Adapts and evaluates several sparsity methods (structured pruning, top‑k selection, etc.) specifically for the verification phase of speculative decoding.
Joint sparsity across model components: Simultaneously sparsifies attention, feed‑forward networks (FFNs), and MoE routing, uncovering redundancy that prior token‑wise sparsity work missed.
Inter‑draft token & inter‑layer reuse: Reuses intermediate results between draft tokens and across transformer layers, cutting repeated work without extra training.
Broad empirical validation: Experiments on summarization, QA, and math‑reasoning benchmarks show a favorable efficiency‑accuracy trade‑off and stable “acceptance length” (the number of draft tokens that pass verification).

Methodology

Speculative Decoding Recap – The model first generates draft tokens using a fast, lightweight decoder. Then a verification pass runs the full LLM on the same context plus the draft tokens to confirm which ones are correct.
Identifying Redundancy – The authors profile verification on long contexts and MoE models, finding that many attention heads, FFN neurons, and expert routes contribute little to the final logits for draft tokens.
Sparse Verification Engine
- Attention sparsity: Keep only the top‑k keys/values per query (structured block‑sparsity) based on a cheap relevance score.
- FFN sparsity: Apply magnitude‑based pruning to hidden dimensions, re‑activating only the most influential neurons per layer.
- MoE sparsity: Limit the number of experts consulted per token (dynamic top‑k routing) and prune low‑weight expert parameters on‑the‑fly.
Reuse Strategies
- Inter‑draft token reuse: Cache attention scores and intermediate activations that are identical across draft tokens, avoiding recomputation.
- Inter‑layer reuse: Propagate cached activations from earlier layers to later ones when the computation pattern repeats.
No Extra Training Required – All sparsity decisions are made at inference time using lightweight heuristics, so the model can be deployed unchanged.

Results & Findings

Task	Model	Baseline (Speculative)	Sparse Verification	Speed‑up*	Accuracy Δ
Summarization (XSum)	LLaMA‑7B	1.8× over vanilla	2.4×	+33%	–0.2 % ROUGE
QA (SQuAD)	MoE‑GLaM‑1.2B	2.1×	2.9×	+38%	–0.1 % EM
Math (MATH)	LLaMA‑13B	1.6×	2.2×	+38%	–0.3 % accuracy

*Speed‑up measured as total inference time (draft + verification) relative to standard autoregressive decoding.

Stable acceptance length: The number of draft tokens accepted per verification pass remains roughly unchanged, meaning the sparsity does not force the system to fall back to token‑by‑token decoding.
Efficiency‑accuracy trade‑off: By adjusting the sparsity hyper‑parameters (e.g., top‑k size), developers can dial in the desired balance—higher speed with a modest drop in metric, or near‑full accuracy with modest gains.

Practical Implications

Faster LLM APIs: Cloud providers can integrate sparse verification to cut latency for services that already use speculative decoding (e.g., chat assistants, code completion).
Cost savings on MoE deployments: MoE models are notoriously expensive because routing many experts per token is heavy; sparsifying the verification step reduces GPU memory bandwidth and compute, lowering operational costs.
Edge‑friendly inference: The approach requires only inference‑time heuristics, making it compatible with existing model checkpoints and hardware accelerators without retraining.
Scalable to longer contexts: As prompts grow (e.g., document‑level summarization), verification becomes the bottleneck; sparse verification mitigates this, enabling real‑time performance even with 8‑16 k token windows.

Limitations & Future Work

Heuristic sensitivity: The sparsity thresholds (top‑k values) are manually tuned; sub‑optimal settings can hurt accuracy, especially on tasks with fine‑grained reasoning.
Hardware‑specific gains: The reported speed‑ups assume GPUs with efficient sparse kernels; on older hardware the benefit may be smaller.
No training‑time sparsity: While avoiding retraining is a plus, the method cannot exploit model‑specific sparsity patterns that could be learned during fine‑tuning.
Future directions: The authors suggest learning adaptive sparsity policies via reinforcement learning, extending the framework to multimodal models, and integrating with other inference accelerators (e.g., FlashAttention).

Authors

Jikai Wang
Jianchao Tan
Yuxuan Hu
Jiayu Qin
Yerui Sun
Yuchen Xie
Xunliang Cai
Juntao Li
Min Zhang

Paper Information

arXiv ID: 2512.21911v1
Categories: cs.CL
Published: December 26, 2025
PDF: Download PDF

[Paper] Accelerate Speculative Decoding with Sparse Computation in Verification

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] A2P-Vis: an Analyzer-to-Presenter Agentic Pipeline for Visual Insights Generation and Reporting

[Paper] Introducing TrGLUE and SentiTurca: A Comprehensive Benchmark for Turkish General Language Understanding and Sentiment Analysis

[Paper] Unifying Learning Dynamics and Generalization in Transformers Scaling Law

[Paper] Context as a Tool: Context Management for Long-Horizon SWE-Agents