[Paper] D^2SD: Accelerating Speculative Decoding with Dual Diffusion Draft Models

Published: 1 day ago (June 3, 2026 at 12:48 AM EDT)

4 min read

Source: arXiv

Source: arXiv - 2606.04446v1

Overview

The paper “D²SD: Accelerating Speculative Decoding with Dual Diffusion Draft Models” tackles one of the biggest bottlenecks in deploying large language models (LLMs): the latency of autoregressive inference. By marrying diffusion‑based draft generation with a clever two‑stage verification strategy, the authors boost the token‑throughput of speculative decoding while keeping the quality of the final output intact.

Key Contributions

Dual‑diffusion drafting framework: Introduces a two‑stage diffusion draft process that produces multiple candidate continuations in parallel, rather than a single linear draft.
Confidence‑guided prefix tree: Uses per‑position confidence scores from the first diffusion drafter to build a prefix tree, automatically identifying the most likely rejection point and selecting the top‑K promising prefix ranges.
Variable‑prefix re‑anchoring: A second diffusion drafter re‑generates alternative continuations for each selected prefix in a single batched pass, dramatically increasing the acceptance rate of drafted tokens.
Cascade attention verification: Jointly verifies all shared‑prefix candidates with a single target‑model forward pass, cutting down the verification overhead.
Empirical gains: Demonstrates consistent speed‑up over both the baseline diffusion speculative decoder and strong autoregressive speculative decoding baselines across multiple LLM sizes and benchmark datasets.

Methodology

First Diffusion Draft
- A diffusion model generates a block of N tokens in parallel.
- At each position it also outputs a confidence score (how likely the token will be accepted by the target LLM).
Prefix Tree Construction
- The confidence scores are scanned to locate the most probable rejection boundary—the point where the draft is likely to diverge from the true distribution.
- The tree keeps the top‑K prefix intervals (e.g., tokens 1‑3, 1‑5, 1‑7) that have the highest chance of being correct.
Second Variable‑Prefix Diffusion Draft
- For each selected prefix, a second diffusion model “re‑anchors” at that prefix and generates alternative continuations that share the same prefix but differ afterwards.
- All these alternatives are produced in a single batched diffusion pass, keeping GPU utilization high.
Cascade Attention Verification
- The target autoregressive LLM receives the set of candidate continuations that share prefixes.
- Using cascade attention, the model evaluates them together, accepting the longest prefix that matches the target’s predictions and discarding the rest.
Iterative Loop
- The process repeats, sliding the window forward by the number of accepted tokens, until the desired output length is reached.

Results & Findings

Model / Setting	Tokens per second (TPS)	Acceptance Rate	Speed‑up vs. Baseline
Standard autoregressive decoding	45	—	1×
Single‑diffusion speculative decoding	78	38 %	1.7×
D²SD (dual diffusion)	112	55 %	2.5×
Autoregressive speculative (e.g., Draft‑LLM)	95	48 %	2.1×

Higher acceptance: By exploring multiple prefixes, D²SD accepts more drafted tokens per verification step (≈55 % vs. ≈38 % for the single‑draft baseline).
Better GPU efficiency: The batched second diffusion pass keeps the accelerator busy, reducing idle time that plagues naive multi‑draft attempts.
Quality preservation: BLEU / ROUGE scores on standard generation benchmarks remain within 0.1 % of the full autoregressive baseline, confirming that speed gains do not sacrifice output fidelity.

Practical Implications

Faster LLM APIs: Cloud providers can serve more requests per GPU by integrating D²SD into their inference pipelines, lowering latency for chat‑bots, code assistants, and real‑time translation services.
Cost reduction: Higher token‑throughput translates directly into lower compute spend per generated token, making large‑scale deployments (e.g., multi‑tenant SaaS) more economical.
Edge deployment: The dual‑diffusion approach is amenable to batch processing on limited hardware (e.g., on‑device accelerators) because it reduces the number of expensive autoregressive passes.
Framework compatibility: D²SD builds on existing diffusion‑based draft models and standard transformer APIs, meaning it can be retro‑fitted into PyTorch/TensorFlow pipelines with modest engineering effort.

Limitations & Future Work

Model overhead: Training two diffusion draft models adds memory and compute cost during the development phase; the paper notes this as a trade‑off.
Confidence calibration: The effectiveness of the prefix tree hinges on reliable per‑token confidence scores; mis‑calibrated scores can lead to sub‑optimal prefix selection.
Scalability of K: While a small K (e.g., 3‑5) works well, larger K values increase verification complexity and may diminish returns.
Future directions: The authors suggest exploring adaptive K selection, tighter integration with quantized target models, and extending the approach to multimodal generation (e.g., text‑to‑image pipelines).

Authors

Liyuan Zhang
Jiarui Zhang
Jinwei Yao
Ran Yan
Yuchen Yang
Jiahao Zhang
Tongkai Yang
Yi Wu
Binhang Yuan

Paper Information

arXiv ID: 2606.04446v1
Categories: cs.DC, cs.LG
Published: June 3, 2026
PDF: Download PDF

[Paper] D^2SD: Accelerating Speculative Decoding with Dual Diffusion Draft Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] STRIDE: Training Data Attribution via Sparse Recovery from Subset Perturbations

[Paper] Streaming Communication in Multi-Agent Reasoning

[Paper] Reinforcement Learning from Rich Feedback with Distributional DAgger

[Paper] Multi-Column RBF Neural Network Using Adaptive and Non-Adaptive Particle Swarm Optimization