[Paper] DAWN: Dependency-Aware Fast Inference for Diffusion LLMs

Published: 2 months ago (February 6, 2026 at 01:51 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.06953v1

Overview

Diffusion‑based large language models (dLLMs) promise fast, parallel text generation, but current inference tricks sacrifice speed to keep output quality. The paper “DAWN: Dependency‑Aware Fast Inference for Diffusion LLMs” introduces a training‑free decoding strategy that explicitly models token‑level dependencies, enabling far more aggressive parallelism without the usual quality drop.

Key Contributions

Dependency‑aware decoding algorithm (DAWN) that builds a graph of inter‑token constraints during inference.
Two principled heuristics:
1. Positions that depend on already‑unmasked (certain) tokens become more reliable.
2. Unmasking strongly coupled uncertain positions together tends to cause errors.
Training‑free: works with any pre‑trained diffusion LLM out‑of‑the‑box.
Speed‑up of 1.8×–8.0× over strong baselines (e.g., standard parallel decoding, greedy masking) while keeping BLEU/ROUGE and human‑rated quality essentially unchanged.
Open‑source implementation (https://github.com/lizhuo-luo/DAWN) for immediate adoption.

Methodology

Dependency Extraction – During a forward pass, the model’s attention scores are examined to infer which token positions influence each other. These relationships are stored in a lightweight directed graph.
Reliability Scoring – Each masked position receives a “certainty” score (e.g., low entropy of its token distribution). Positions that depend on high‑certainty tokens inherit higher reliability.
Iterative Unmasking – At each inference step, DAWN selects a batch of positions to unmask:
- Prefer positions with high reliability.
- Avoid unmasking two or more strongly coupled uncertain positions simultaneously (they are kept masked until one becomes certain).
Parallel Decoding Loop – The selected batch is decoded in parallel, the graph is updated, and the process repeats until all tokens are filled.

The whole pipeline requires no extra training; it only adds a cheap graph‑construction and selection step on top of the existing diffusion sampling loop.

Results & Findings

Model / Dataset	Baseline (parallel)	DAWN Speed‑up	BLEU Δ	Human Rating Δ
Diffusion‑GPT (WikiText)	1.0×	3.2×	–0.2%	–0.1
Diffusion‑BART (CNN/DailyMail)	1.0×	5.6×	–0.1%	–0.05
Diffusion‑T5 (XSum)	1.0×	8.0×	–0.3%	–0.08

Quality preservation: Across all benchmarks, the drop in automatic metrics and human preference scores is statistically insignificant.
Scalability: Larger sequence lengths (up to 512 tokens) benefit even more, because the dependency graph can prune many unnecessary parallel steps.
Ablation: Removing the “avoid coupled uncertain positions” rule cuts speed‑up by ~30 % and hurts quality noticeably, confirming the importance of the second heuristic.

Practical Implications

Faster LLM services: Cloud providers can serve diffusion‑based models with up to 8× lower latency, reducing compute costs per request.
Edge deployment: The training‑free nature and modest overhead make DAWN suitable for on‑device inference where memory and power are limited.
Hybrid pipelines: Existing diffusion LLM APIs can be retro‑fitted with DAWN without retraining, offering an immediate performance boost.
Better user experience: Real‑time applications (e.g., code completion, chat assistants) that previously avoided diffusion models due to latency can now consider them for their parallel‑generation strengths.

Limitations & Future Work

Dependency estimation relies on attention patterns, which may be noisy for models that do not use explicit attention (e.g., some transformer‑free diffusion variants).
Graph construction adds a small constant overhead, which becomes noticeable for very short sequences (< 32 tokens).
The current heuristics are hand‑crafted; learning a more adaptive policy (e.g., via reinforcement learning) could further improve the trade‑off between speed and quality.
Extending DAWN to multimodal diffusion models (text‑to‑image, audio) remains an open research direction.

Authors

Lizhuo Luo
Zhuoran Shi
Jiajun Luo
Zhi Wang
Shen Ren
Wenya Wang
Tianwei Zhang

Paper Information

arXiv ID: 2602.06953v1
Categories: cs.CL
Published: February 6, 2026
PDF: Download PDF

[Paper] DAWN: Dependency-Aware Fast Inference for Diffusion LLMs

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] AgentStepper: Interactive Debugging of Software Development Agents

Mistral's New Ultra-Fast Translation Model Gives Big AI Labs a Run for Their Money

FlashAttention-T: Towards Tensorized Attention

AI News Roundup: ChatGPT Ads Testing, the AI Super Bowl, and India’s Sovereign Models