[Paper] DAWN: Dependency-Aware Fast Inference for Diffusion LLMs
Source: arXiv - 2602.06953v1
Overview
Diffusion‑based large language models (dLLMs) promise fast, parallel text generation, but current inference tricks sacrifice speed to keep output quality. The paper “DAWN: Dependency‑Aware Fast Inference for Diffusion LLMs” introduces a training‑free decoding strategy that explicitly models token‑level dependencies, enabling far more aggressive parallelism without the usual quality drop.
Key Contributions
- Dependency‑aware decoding algorithm (DAWN) that builds a graph of inter‑token constraints during inference.
- Two principled heuristics:
- Positions that depend on already‑unmasked (certain) tokens become more reliable.
- Unmasking strongly coupled uncertain positions together tends to cause errors.
- Training‑free: works with any pre‑trained diffusion LLM out‑of‑the‑box.
- Speed‑up of 1.8×–8.0× over strong baselines (e.g., standard parallel decoding, greedy masking) while keeping BLEU/ROUGE and human‑rated quality essentially unchanged.
- Open‑source implementation (https://github.com/lizhuo-luo/DAWN) for immediate adoption.
Methodology
- Dependency Extraction – During a forward pass, the model’s attention scores are examined to infer which token positions influence each other. These relationships are stored in a lightweight directed graph.
- Reliability Scoring – Each masked position receives a “certainty” score (e.g., low entropy of its token distribution). Positions that depend on high‑certainty tokens inherit higher reliability.
- Iterative Unmasking – At each inference step, DAWN selects a batch of positions to unmask:
- Prefer positions with high reliability.
- Avoid unmasking two or more strongly coupled uncertain positions simultaneously (they are kept masked until one becomes certain).
- Parallel Decoding Loop – The selected batch is decoded in parallel, the graph is updated, and the process repeats until all tokens are filled.
The whole pipeline requires no extra training; it only adds a cheap graph‑construction and selection step on top of the existing diffusion sampling loop.
Results & Findings
| Model / Dataset | Baseline (parallel) | DAWN Speed‑up | BLEU Δ | Human Rating Δ |
|---|---|---|---|---|
| Diffusion‑GPT (WikiText) | 1.0× | 3.2× | –0.2% | –0.1 |
| Diffusion‑BART (CNN/DailyMail) | 1.0× | 5.6× | –0.1% | –0.05 |
| Diffusion‑T5 (XSum) | 1.0× | 8.0× | –0.3% | –0.08 |
- Quality preservation: Across all benchmarks, the drop in automatic metrics and human preference scores is statistically insignificant.
- Scalability: Larger sequence lengths (up to 512 tokens) benefit even more, because the dependency graph can prune many unnecessary parallel steps.
- Ablation: Removing the “avoid coupled uncertain positions” rule cuts speed‑up by ~30% and hurts quality noticeably, confirming the importance of the second heuristic.
Practical Implications
- Faster LLM services: Cloud providers can serve diffusion‑based models with up to 8× lower latency, reducing compute costs per request.
- Edge deployment: The training‑free nature and modest overhead make DAWN suitable for on‑device inference where memory and power are limited.
- Hybrid pipelines: Existing diffusion LLM APIs can be retro‑fitted with DAWN without retraining, offering an immediate performance boost.
- Better user experience: Real‑time applications (e.g., code completion, chat assistants) that previously avoided diffusion models due to latency can now consider them for their parallel‑generation strengths.
Limitations & Future Work
- Dependency estimation relies on attention patterns, which may be noisy for models that do not use explicit attention (e.g., some transformer‑free diffusion variants).
- Graph construction adds a small constant overhead, which becomes noticeable for very short sequences (< 32 tokens).
- The current heuristics are hand‑crafted; learning a more adaptive policy (e.g., via reinforcement learning) could further improve the trade‑off between speed and quality.
- Extending DAWN to multimodal diffusion models (text‑to‑image, audio) remains an open research direction.
Authors
- Lizhuo Luo
- Zhuoran Shi
- Jiajun Luo
- Zhi Wang
- Shen Ren
- Wenya Wang
- Tianwei Zhang
Paper Information
- arXiv ID: 2602.06953v1
- Categories: cs.CL
- Published: February 6, 2026
- PDF: Download PDF