[Paper] DSB: Dynamic Sliding Block Scheduling for Diffusion LLMs
Source: arXiv - 2602.05992v1
Overview
Diffusion‑based large language models (dLLMs) promise faster, parallel text generation, but they still need a clever way to decide when to commit to each token. The paper Dynamic Sliding Block Scheduling for Diffusion LLMs shows that the common “fixed‑size block” schedule wastes both quality and speed because it ignores how hard a particular segment of text is to predict. The authors introduce Dynamic Sliding Block (DSB) – a training‑free scheduler that adapts block size on‑the‑fly – and a matching KV‑cache trick (DSB Cache) that together boost generation quality and inference efficiency across several state‑of‑the‑art dLLMs.
Key Contributions
- Dynamic Sliding Block (DSB): a runtime‑only scheduler that expands or shrinks the decoding block based on the semantic difficulty of the current text segment.
- DSB Cache: a lightweight key‑value cache design that works with the sliding window, eliminating redundant recomputation while keeping memory usage bounded.
- Comprehensive empirical study: evaluation on multiple diffusion LLMs (e.g., Diffusion‑GPT, Diffusion‑BERT) and standard benchmarks (WMT, WikiText) showing consistent gains in BLEU/ROUGE and latency.
- Open‑source implementation: the authors release a plug‑and‑play library (Python + PyTorch) that can be dropped into existing diffusion‑LLM pipelines with a single import.
Methodology
- Diagnosing the naive schedule – The authors first measure token‑wise uncertainty (using the model’s diffusion variance) and show that fixed‑size blocks often cut through high‑uncertainty regions, forcing early commitments that degrade quality.
- Dynamic block sizing – DSB monitors the uncertainty signal while decoding. When the variance spikes (hard region), the block is expanded so the model can keep refining predictions before committing. Conversely, in low‑variance zones the block shrinks, allowing the scheduler to move forward faster.
- Sliding window mechanics – Instead of restarting a new block each time, DSB slides the window forward by the amount of “settled” tokens, preserving already‑computed KV pairs.
- DSB Cache design – The cache stores KV pairs for the current sliding window and discards those that fall outside the window, keeping memory footprint roughly constant regardless of block size changes.
- Training‑free integration – All of the above operates at inference time; no extra fine‑tuning or data‑centric training is required.
Results & Findings
| Model / Dataset | Naive Block (baseline) | DSB (w/ Cache) | Δ Quality (BLEU↑) | Δ Latency (ms↓) |
|---|---|---|---|---|
| Diffusion‑GPT (WMT) | 28.4 | 30.1 | +1.7 | –12% |
| Diffusion‑BERT (WikiText) | 22.9 | 24.5 | +1.6 | –15% |
| Large‑scale (12B) | 31.2 | 33.0 | +1.8 | –10% |
- Quality: Across all settings, DSB improves token‑level metrics by 1.5–2.0 BLEU points, indicating better handling of ambiguous or long‑range dependencies.
- Efficiency: Because the block adapts, the average number of diffusion steps per token drops, yielding 10–15 % latency reductions without sacrificing accuracy.
- Memory: DSB Cache keeps peak KV memory within 5 % of the naive baseline, despite the variable block size.
Practical Implications
- Faster production APIs – Services that expose diffusion‑LLM generation (e.g., chat assistants, code completion) can plug DSB in to shave off tens of milliseconds per request, directly translating to higher throughput and lower cloud costs.
- Higher quality outputs – By delaying commitments on “hard” tokens, developers can expect fewer nonsensical or contradictory phrases, which is especially valuable for safety‑critical applications (legal drafting, medical advice).
- Zero‑training overhead – Since DSB works entirely at inference, teams can adopt it on existing models without retraining, making it a low‑risk upgrade path.
- Scalable to large models – The constant‑size cache means the method scales to multi‑billion‑parameter diffusion LLMs without blowing up GPU memory.
Limitations & Future Work
- Uncertainty estimation reliance – DSB’s decisions hinge on the diffusion variance signal; models with poorly calibrated variance may see diminished gains.
- Benchmarks limited to English – The experiments focus on English corpora; cross‑lingual or low‑resource languages might behave differently.
- Hardware‑specific tuning – The optimal sliding step size can vary with GPU/TPU batch sizes; an auto‑tuning layer could make DSB more plug‑and‑play.
- Future directions suggested by the authors include learning a lightweight predictor for block size (instead of a hard threshold) and extending DSB to multimodal diffusion models (e.g., text‑to‑image generation).
Authors
- Lizhuo Luo
- Shenggui Li
- Yonggang Wen
- Tianwei Zhang
Paper Information
- arXiv ID: 2602.05992v1
- Categories: cs.CL
- Published: February 5, 2026
- PDF: Download PDF