[Paper] Scaling Beyond Masked Diffusion Language Models

Published: 3 days ago (February 16, 2026 at 01:54 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.15014v1

Overview

The paper “Scaling Beyond Masked Diffusion Language Models” investigates how different discrete diffusion strategies—especially the widely‑used Masked diffusion—behave when they are scaled up to billions of parameters. By systematically measuring perplexity, compute cost, and sampling speed, the authors reveal that the “best‑perplexity” metric can be misleading across diffusion families and that alternative diffusion schemes can be more efficient and even outperform Masked diffusion on certain downstream tasks.

Key Contributions

First large‑scale empirical study of uniform‑state and interpolating diffusion methods alongside Masked diffusion, covering models up to 1.7 B parameters.
Demonstrates a ~12 % FLOPs‑efficiency gain for Masked diffusion when trained with a simple cross‑entropy loss instead of the usual denoising objective.
Shows that perplexity is not a universal proxy for generation quality across diffusion families; speed‑quality trade‑offs matter more in practice.
Finds that uniform‑state diffusion matches or exceeds Masked diffusion on likelihood benchmarks and outperforms both Masked and autoregressive models on GSM8K (a math‑reasoning benchmark), despite higher validation perplexity.
Releases code, pretrained checkpoints, and tutorial videos, enabling the community to reproduce and extend the work.

Methodology

Model Families – The authors train three families of discrete diffusion language models:
- Masked diffusion (the current de‑facto standard).
- Uniform‑state diffusion, which treats each token as a uniform random state during diffusion.
- Interpolating diffusion, which blends masked and uniform steps.
Scaling Regime – For each family they train models at several sizes (≈125 M, 350 M, 1.7 B parameters) while keeping the training compute budget comparable across families.
Training Objective – Instead of the traditional denoising loss, they experiment with a plain cross‑entropy loss that predicts the original token directly from the noisy input. This simple change yields the reported FLOPs savings for Masked diffusion.
Evaluation Suite –
- Perplexity on standard language‑modeling test sets (e.g., WikiText‑103, C4).
- Sampling speed measured in tokens per second on a single GPU.
- Downstream task performance on GSM8K (grade‑school math) and other reasoning benchmarks.
- Pareto analysis to visualize the trade‑off between generation quality (perplexity or task accuracy) and computational cost.
Analysis – The authors fit scaling laws (log‑log relationships between model size, compute, and performance) for each diffusion family, enabling extrapolation to larger models.

Results & Findings

Metric	Masked Diffusion	Uniform‑State Diffusion	Interpolating Diffusion
Perplexity (validation)	Best within its family	Slightly worse than Masked	Between the two
FLOPs per training step	Baseline	~12 % lower (with cross‑entropy)	Comparable to Uniform
Sampling speed (tokens/s)	~1.0× (baseline)	~1.4× faster	~1.2× faster
GSM8K accuracy	71 %	78 % (top)	74 %
Scaling exponent (size → perf.)	Consistent with prior diffusion work	Similar exponent, but higher intercept (better low‑compute regime)	Intermediate

Interpretation

Perplexity alone is insufficient: Uniform‑state diffusion has higher perplexity but generates faster and solves more math problems.
Cross‑entropy training cuts compute without hurting quality, suggesting that the denoising objective is over‑engineered for Masked diffusion.
Pareto frontiers show that for a given compute budget, uniform‑state diffusion often dominates Masked diffusion, especially when fast sampling is required.

Practical Implications

Faster Generation for Production: Developers building chatbots, code assistants, or real‑time translation services can consider uniform‑state diffusion to halve latency while staying within the same compute budget.
Cost‑Effective Model Scaling: The 12 % FLOPs reduction means lower cloud‑training bills, making large diffusion models more accessible for startups and research labs.
Task‑Specific Model Choice: For reasoning‑heavy workloads (e.g., math tutoring, data‑analysis assistants), uniform‑state diffusion may yield higher downstream accuracy even if perplexity looks worse.
Simplified Training Pipelines: Switching to a plain cross‑entropy loss removes the need for complex noise‑schedule engineering, easing integration with existing deep‑learning frameworks (PyTorch, JAX).
Benchmarking Guidance: The paper encourages the community to report speed‑quality Pareto curves rather than a single perplexity number when comparing diffusion families.

Limitations & Future Work

Evaluation Scope: The study focuses on English language data and a limited set of downstream tasks (primarily GSM8K). Generalization to multilingual or domain‑specific corpora remains open.
Sampling Algorithms: While the authors use a basic reverse‑diffusion sampler, more sophisticated samplers (e.g., adaptive step‑size, classifier‑guided) could further shift the speed‑quality frontier.
Model Size Ceiling: Experiments stop at 1.7 B parameters; it is unclear whether the observed trends hold for 10 B+ models where memory and parallelism constraints dominate.
Theoretical Understanding: The paper empirically shows perplexity’s limits across families but does not provide a formal analysis of why uniform‑state diffusion yields better downstream reasoning. Future work could explore the connection between diffusion noise patterns and reasoning capabilities.

All code, pretrained checkpoints, and tutorial videos are publicly available at the project page: http://s-sahoo.github.io/scaling-dllms.

Authors

Subham Sekhar Sahoo
Jean‑Marie Lemercier
Zhihan Yang
Justin Deschenaux
Jingyu Liu
John Thickstun
Ante Jukic

Paper Information

arXiv ID: 2602.15014v1
Categories: cs.LG, cs.CL
Published: February 16, 2026
PDF: Download PDF

[Paper] Scaling Beyond Masked Diffusion Language Models

Overview

Key Contributions

Methodology

Results & Findings

Interpretation

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents

[Paper] Align Once, Benefit Multilingually: Enforcing Multilingual Consistency for LLM Safety Alignment

[Paper] Who can we trust? LLM-as-a-jury for Comparative Assessment

[Paper] Explainable AI: Context-Aware Layer-Wise Integrated Gradients for Explaining Transformer Models