[Paper] DSB: Dynamic Sliding Block Scheduling for Diffusion LLMs

Published: 2 months ago (February 5, 2026 at 01:41 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.05992v1

Overview

Diffusion‑based large language models (dLLMs) promise faster, parallel text generation, but they still need a clever way to decide when to commit to each token. The paper Dynamic Sliding Block Scheduling for Diffusion LLMs shows that the common “fixed‑size block” schedule wastes both quality and speed because it ignores how hard a particular segment of text is to predict. The authors introduce Dynamic Sliding Block (DSB) – a training‑free scheduler that adapts block size on‑the‑fly – and a matching KV‑cache trick (DSB Cache) that together boost generation quality and inference efficiency across several state‑of‑the‑art dLLMs.

Key Contributions

Dynamic Sliding Block (DSB): a runtime‑only scheduler that expands or shrinks the decoding block based on the semantic difficulty of the current text segment.
DSB Cache: a lightweight key‑value cache design that works with the sliding window, eliminating redundant recomputation while keeping memory usage bounded.
Comprehensive empirical study: evaluation on multiple diffusion LLMs (e.g., Diffusion‑GPT, Diffusion‑BERT) and standard benchmarks (WMT, WikiText) showing consistent gains in BLEU/ROUGE and latency.
Open‑source implementation: the authors release a plug‑and‑play library (Python + PyTorch) that can be dropped into existing diffusion‑LLM pipelines with a single import.

Methodology

Diagnosing the naive schedule – The authors first measure token‑wise uncertainty (using the model’s diffusion variance) and show that fixed‑size blocks often cut through high‑uncertainty regions, forcing early commitments that degrade quality.
Dynamic block sizing – DSB monitors the uncertainty signal while decoding. When the variance spikes (hard region), the block is expanded so the model can keep refining predictions before committing. Conversely, in low‑variance zones the block shrinks, allowing the scheduler to move forward faster.
Sliding window mechanics – Instead of restarting a new block each time, DSB slides the window forward by the amount of “settled” tokens, preserving already‑computed KV pairs.
DSB Cache design – The cache stores KV pairs for the current sliding window and discards those that fall outside the window, keeping memory footprint roughly constant regardless of block size changes.
Training‑free integration – All of the above operates at inference time; no extra fine‑tuning or data‑centric training is required.

Results & Findings

Model / Dataset	Naive Block (baseline)	DSB (w/ Cache)	Δ Quality (BLEU↑)	Δ Latency (ms↓)
Diffusion‑GPT (WMT)	28.4	30.1	+1.7	–12%
Diffusion‑BERT (WikiText)	22.9	24.5	+1.6	–15%
Large‑scale (12B)	31.2	33.0	+1.8	–10%

Quality: Across all settings, DSB improves token‑level metrics by 1.5–2.0 BLEU points, indicating better handling of ambiguous or long‑range dependencies.
Efficiency: Because the block adapts, the average number of diffusion steps per token drops, yielding 10–15 % latency reductions without sacrificing accuracy.
Memory: DSB Cache keeps peak KV memory within 5 % of the naive baseline, despite the variable block size.

Practical Implications

Faster production APIs – Services that expose diffusion‑LLM generation (e.g., chat assistants, code completion) can plug DSB in to shave off tens of milliseconds per request, directly translating to higher throughput and lower cloud costs.
Higher quality outputs – By delaying commitments on “hard” tokens, developers can expect fewer nonsensical or contradictory phrases, which is especially valuable for safety‑critical applications (legal drafting, medical advice).
Zero‑training overhead – Since DSB works entirely at inference, teams can adopt it on existing models without retraining, making it a low‑risk upgrade path.
Scalable to large models – The constant‑size cache means the method scales to multi‑billion‑parameter diffusion LLMs without blowing up GPU memory.

Limitations & Future Work

Uncertainty estimation reliance – DSB’s decisions hinge on the diffusion variance signal; models with poorly calibrated variance may see diminished gains.
Benchmarks limited to English – The experiments focus on English corpora; cross‑lingual or low‑resource languages might behave differently.
Hardware‑specific tuning – The optimal sliding step size can vary with GPU/TPU batch sizes; an auto‑tuning layer could make DSB more plug‑and‑play.
Future directions suggested by the authors include learning a lightweight predictor for block size (instead of a hard threshold) and extending DSB to multimodal diffusion models (e.g., text‑to‑image generation).

Authors

Lizhuo Luo
Shenggui Li
Yonggang Wen
Tianwei Zhang

Paper Information

arXiv ID: 2602.05992v1
Categories: cs.CL
Published: February 5, 2026
PDF: Download PDF

[Paper] DSB: Dynamic Sliding Block Scheduling for Diffusion LLMs

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] InftyThink+: Effective and Efficient Infinite-Horizon Reasoning via Reinforcement Learning

[Paper] Optimal Turkish Subword Strategies at Scale: Systematic Evaluation of Data, Vocabulary, Morphology Interplay

[Paper] Uncovering Cross-Objective Interference in Multi-Objective Alignment

[Paper] SEMA: Simple yet Effective Learning for Multi-Turn Jailbreak Attacks