[Paper] Optimizing Decoding Paths in Masked Diffusion Models by Quantifying Uncertainty
Source: arXiv - 2512.21336v1
Overview
Masked Diffusion Models (MDMs) have emerged as a powerful alternative to traditional autoregressive generators, enabling fast, non‑sequential generation of text, code, and plans. However, the order in which masked tokens are “unmasked” (the decoding path) can dramatically affect the final output quality. This paper formalizes that problem, introduces a new uncertainty metric called Denoising Entropy, and shows how to steer the decoding process toward higher‑quality results.
Key Contributions
- Formalization of decoding‑path sensitivity – Demonstrates that output variance in MDMs stems from the cumulative predictive uncertainty along the chosen generation trajectory.
- Denoising Entropy metric – A tractable, model‑internal measure that quantifies the uncertainty of each denoising step.
- Two entropy‑guided algorithms:
- Post‑hoc path selection – evaluates multiple sampled paths after generation and picks the one with the lowest total entropy.
- Real‑time guidance – dynamically chooses the next mask to fill based on the current entropy landscape, steering the process on‑the‑fly.
- Empirical validation – Consistent gains across a suite of challenging benchmarks (reasoning, planning, code synthesis), often surpassing strong autoregressive baselines.
- Open‑source tooling – The authors release code for computing Denoising Entropy and integrating the guidance strategies into existing MDM pipelines.
Methodology
-
Quantifying Uncertainty
- At each diffusion step the model predicts a distribution over possible token values for the currently masked positions.
- The Denoising Entropy is simply the Shannon entropy of that distribution, summed over all masked tokens. Lower entropy indicates the model is more confident about the next denoising move.
-
Path Optimization Strategies
- Post‑hoc selection: Run the MDM multiple times with different random masking orders, compute the total entropy for each full trajectory, and keep the trajectory with the smallest sum. This is cheap to parallelize and requires no changes to the model itself.
- Real‑time guidance: While generating, evaluate the entropy that would result from unmasking each candidate token next. Pick the token (or small group of tokens) that yields the lowest immediate entropy, then proceed. This turns the decoding process into a greedy, uncertainty‑driven search.
-
Evaluation Protocol
- Benchmarks: GSM8K (math reasoning), MiniWoB (interactive planning), HumanEval (code generation).
- Metrics: Exact match / pass@k for code, success rate for planning, and accuracy for reasoning.
- Baselines: Standard MDM with random decoding order, and strong autoregressive transformers (e.g., GPT‑Neo, CodeGen).
Results & Findings
| Benchmark | Standard MDM | Entropy‑Guided (post‑hoc) | Entropy‑Guided (real‑time) | Autoregressive Baseline |
|---|---|---|---|---|
| GSM8K (accuracy) | 71.2 % | 78.5 % | 77.9 % | 73.4 % |
| MiniWoB (success) | 58.1 % | 66.3 % | 65.8 % | 62.0 % |
| HumanEval (pass@1) | 24.7 % | 31.4 % | 30.9 % | 28.5 % |
- Both entropy‑guided methods consistently outperform the vanilla MDM by 7–9 percentage points on reasoning and planning tasks, and ~6 points on code synthesis.
- Real‑time guidance matches post‑hoc selection while requiring only a single forward pass per token, making it practical for production settings.
- Ablation studies confirm that the gains are primarily due to the entropy‑driven ordering rather than extra compute.
Practical Implications
- Higher‑quality non‑autoregressive generation: Developers can now deploy MDM‑based services (e.g., code autocomplete, plan synthesis) that retain the speed advantage of parallel decoding without sacrificing output fidelity.
- Plug‑and‑play improvement: Since the entropy metric is derived from the model’s own logits, existing MDM checkpoints can be upgraded with minimal engineering effort—just add the entropy computation and the greedy selector.
- Resource‑efficient sampling: The post‑hoc method leverages parallel hardware to explore multiple decoding orders simultaneously, offering a “best‑of‑N” strategy that scales with GPU count.
- Uncertainty‑aware debugging: Denoising Entropy can be visualized to pinpoint steps where the model is unsure, helping engineers diagnose failure modes in generated text or code.
- Broader AI safety angle: By steering generation away from high‑entropy (i.e., uncertain) regions, the approach may reduce hallucinations or unsafe outputs in downstream applications.
Limitations & Future Work
- Computational overhead: Real‑time guidance adds a modest per‑step cost (entropy evaluation for each candidate mask). In extremely high‑throughput settings this could offset some of the parallelism gains.
- Greedy nature: The current guidance is locally optimal; more sophisticated search (e.g., beam‑search over entropy) might capture better global trajectories but would increase complexity.
- Domain specificity: Experiments focus on reasoning, planning, and code; it remains to be seen how entropy‑guided decoding performs on open‑ended text generation (e.g., story writing).
- Theoretical guarantees: While entropy correlates with quality empirically, a formal link between Denoising Entropy and downstream task metrics is still an open research question.
Future work could explore hybrid strategies that combine entropy guidance with learned policies, extend the metric to multimodal diffusion models, and integrate it into reinforcement‑learning‑based fine‑tuning pipelines.
Authors
- Ziyu Chen
- Xinbei Jiang
- Peng Sun
- Tao Lin
Paper Information
- arXiv ID: 2512.21336v1
- Categories: cs.CL, cs.AI, cs.LG
- Published: December 24, 2025
- PDF: Download PDF