[Paper] Scaling Behavior of Discrete Diffusion Language Models
Source: arXiv - 2512.10858v1
Overview
This paper investigates how Discrete Diffusion Language Models (DLMs) scale compared to the dominant autoregressive language models (ALMs). By systematically varying the diffusion noise (from masked to uniform) and tuning key hyper‑parameters, the authors uncover distinct scaling regimes that could make diffusion‑based models more compute‑ or data‑efficient in real‑world settings.
Key Contributions
- Comprehensive scaling study of DLMs across a spectrum of diffusion noises (masked ↔ uniform).
- Identification of noise‑dependent scaling laws: uniform diffusion favors parameter‑rich, data‑light regimes, while masked diffusion behaves oppositely.
- Empirical validation of the predicted laws by training a 10‑billion‑parameter uniform diffusion model on ~(10^{22}) FLOPs—the largest publicly reported uniform diffusion LM.
- Practical guidance on batch‑size and learning‑rate schedules for diffusion LMs, filling a gap left by prior work.
- Open‑source release of training scripts and checkpoints, enabling reproducibility and community extensions.
Methodology
- Model family – The authors use the same transformer backbone for all experiments, swapping only the diffusion objective (masked, uniform, or interpolations).
- Noise interpolation – A scalar (\alpha) smoothly blends masked and uniform corruption, allowing a continuous sweep of diffusion types.
- Training regimes – Two primary axes are explored:
- Compute‑bound: fixed FLOP budget, varying model size and data volume.
- Data‑bound: fixed dataset size, scaling up parameters and compute.
- Hyper‑parameter sweeps – Systematic grid searches over batch size (from 256 to 8192) and learning‑rate schedules (linear warm‑up + cosine decay) to isolate their impact on scaling curves.
- Metrics – Standard cross‑entropy loss on a held‑out validation set, plus downstream zero‑shot tasks (e.g., cloze, QA) for qualitative sanity checks.
- Scaling law fitting – Power‑law fits of the form (L = A \cdot (C)^{-\beta} + B) (where (C) is compute) are performed separately for each noise type.
Results & Findings
| Noise type | Compute‑bound scaling (loss) | Data‑bound scaling (loss) | Parameter‑efficiency | Data‑efficiency |
|---|---|---|---|---|
| Masked | Steeper decline with more data; plateaus earlier with compute | Needs more data to reach low loss | Favors smaller models when data is abundant | Less favorable in data‑scarce regimes |
| Uniform | Flatter curve; similar asymptotic loss across sizes | Better loss with fewer data points, given enough parameters | Benefits from larger models even with limited data | More data‑efficient in compute‑constrained settings |
| Interpolated (mid‑range) | Behaves between the two extremes | Shows transitional behavior; no clear advantage over extremes | — | — |
- The 10B uniform diffusion model achieved a validation loss within 2 % of the best‑performing ALM of comparable size, confirming that the predicted scaling law holds at the billion‑parameter scale.
- Uniform diffusion models required ~30 % fewer training tokens to hit the same loss as masked diffusion under identical compute budgets.
- Batch size scaling followed the classic “linear scaling rule” up to a point (≈4096), after which diminishing returns appeared, especially for masked diffusion.
Practical Implications
- Compute‑constrained startups can opt for uniform diffusion LMs: invest in larger models but train on smaller curated datasets, reducing data acquisition costs.
- Edge‑device fine‑tuning: Since uniform diffusion tolerates less data, developers can adapt a pre‑trained 10B diffusion model with a modest on‑device dataset, potentially yielding better sample efficiency than fine‑tuning an autoregressive counterpart.
- Training pipelines: The paper’s batch‑size and learning‑rate recommendations can be directly plugged into existing transformer training scripts (e.g., DeepSpeed, Megatron‑LM) to accelerate diffusion‑LM experiments.
- Research tooling: Open‑source checkpoints enable benchmarking diffusion models on downstream tasks (code generation, summarization) without the massive compute overhead of training from scratch.
- Hybrid architectures: The smooth interpolation between noise types suggests a new design space where a model could dynamically switch diffusion regimes based on available compute or data, offering adaptive efficiency.
Limitations & Future Work
- Task coverage – Evaluation is limited to language modeling loss and a few zero‑shot benchmarks; more extensive downstream task suites (e.g., reasoning, coding) are needed to gauge real‑world utility.
- Hardware diversity – Experiments were run on NVIDIA A100 GPUs; scaling behavior on TPUs or newer GPU architectures may differ.
- Energy considerations – While FLOPs are reported, actual energy consumption and carbon impact were not measured.
- Theoretical grounding – The observed noise‑dependent scaling laws are empirically derived; a deeper theoretical explanation (e.g., information‑theoretic analysis) remains open.
- Hybrid diffusion – Future work could explore adaptive or curriculum‑based noise schedules that transition from masked to uniform diffusion during training, potentially combining the strengths of both regimes.
Authors
- Dimitri von Rütte
- Janis Fluri
- Omead Pooladzandi
- Bernhard Schölkopf
- Thomas Hofmann
- Antonio Orvieto
Paper Information
- arXiv ID: 2512.10858v1
- Categories: cs.LG
- Published: December 11, 2025
- PDF: Download PDF