[Paper] DAWN: Dependency-Aware Fast Inference for Diffusion LLMs

Published: (February 6, 2026 at 01:51 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2602.06953v1

Overview

Diffusion‑based large language models (dLLMs) promise fast, parallel text generation, but current inference tricks sacrifice speed to keep output quality. The paper “DAWN: Dependency‑Aware Fast Inference for Diffusion LLMs” introduces a training‑free decoding strategy that explicitly models token‑level dependencies, enabling far more aggressive parallelism without the usual quality drop.

Key Contributions

  • Dependency‑aware decoding algorithm (DAWN) that builds a graph of inter‑token constraints during inference.
  • Two principled heuristics:
    1. Positions that depend on already‑unmasked (certain) tokens become more reliable.
    2. Unmasking strongly coupled uncertain positions together tends to cause errors.
  • Training‑free: works with any pre‑trained diffusion LLM out‑of‑the‑box.
  • Speed‑up of 1.8×–8.0× over strong baselines (e.g., standard parallel decoding, greedy masking) while keeping BLEU/ROUGE and human‑rated quality essentially unchanged.
  • Open‑source implementation (https://github.com/lizhuo-luo/DAWN) for immediate adoption.

Methodology

  1. Dependency Extraction – During a forward pass, the model’s attention scores are examined to infer which token positions influence each other. These relationships are stored in a lightweight directed graph.
  2. Reliability Scoring – Each masked position receives a “certainty” score (e.g., low entropy of its token distribution). Positions that depend on high‑certainty tokens inherit higher reliability.
  3. Iterative Unmasking – At each inference step, DAWN selects a batch of positions to unmask:
    • Prefer positions with high reliability.
    • Avoid unmasking two or more strongly coupled uncertain positions simultaneously (they are kept masked until one becomes certain).
  4. Parallel Decoding Loop – The selected batch is decoded in parallel, the graph is updated, and the process repeats until all tokens are filled.
    The whole pipeline requires no extra training; it only adds a cheap graph‑construction and selection step on top of the existing diffusion sampling loop.

Results & Findings

Model / DatasetBaseline (parallel)DAWN Speed‑upBLEU ΔHuman Rating Δ
Diffusion‑GPT (WikiText)1.0×3.2×–0.2%–0.1
Diffusion‑BART (CNN/DailyMail)1.0×5.6×–0.1%–0.05
Diffusion‑T5 (XSum)1.0×8.0×–0.3%–0.08
  • Quality preservation: Across all benchmarks, the drop in automatic metrics and human preference scores is statistically insignificant.
  • Scalability: Larger sequence lengths (up to 512 tokens) benefit even more, because the dependency graph can prune many unnecessary parallel steps.
  • Ablation: Removing the “avoid coupled uncertain positions” rule cuts speed‑up by ~30% and hurts quality noticeably, confirming the importance of the second heuristic.

Practical Implications

  • Faster LLM services: Cloud providers can serve diffusion‑based models with up to 8× lower latency, reducing compute costs per request.
  • Edge deployment: The training‑free nature and modest overhead make DAWN suitable for on‑device inference where memory and power are limited.
  • Hybrid pipelines: Existing diffusion LLM APIs can be retro‑fitted with DAWN without retraining, offering an immediate performance boost.
  • Better user experience: Real‑time applications (e.g., code completion, chat assistants) that previously avoided diffusion models due to latency can now consider them for their parallel‑generation strengths.

Limitations & Future Work

  • Dependency estimation relies on attention patterns, which may be noisy for models that do not use explicit attention (e.g., some transformer‑free diffusion variants).
  • Graph construction adds a small constant overhead, which becomes noticeable for very short sequences (< 32 tokens).
  • The current heuristics are hand‑crafted; learning a more adaptive policy (e.g., via reinforcement learning) could further improve the trade‑off between speed and quality.
  • Extending DAWN to multimodal diffusion models (text‑to‑image, audio) remains an open research direction.

Authors

  • Lizhuo Luo
  • Zhuoran Shi
  • Jiajun Luo
  • Zhi Wang
  • Shen Ren
  • Wenya Wang
  • Tianwei Zhang

Paper Information

  • arXiv ID: 2602.06953v1
  • Categories: cs.CL
  • Published: February 6, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »