[Paper] Scaling Behavior of Discrete Diffusion Language Models

Published: (December 11, 2025 at 12:54 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.10858v1

Overview

This paper investigates how Discrete Diffusion Language Models (DLMs) scale compared to the dominant autoregressive language models (ALMs). By systematically varying the diffusion noise (from masked to uniform) and tuning key hyper‑parameters, the authors uncover distinct scaling regimes that could make diffusion‑based models more compute‑ or data‑efficient in real‑world settings.

Key Contributions

  • Comprehensive scaling study of DLMs across a spectrum of diffusion noises (masked ↔ uniform).
  • Identification of noise‑dependent scaling laws: uniform diffusion favors parameter‑rich, data‑light regimes, while masked diffusion behaves oppositely.
  • Empirical validation of the predicted laws by training a 10‑billion‑parameter uniform diffusion model on ~(10^{22}) FLOPs—the largest publicly reported uniform diffusion LM.
  • Practical guidance on batch‑size and learning‑rate schedules for diffusion LMs, filling a gap left by prior work.
  • Open‑source release of training scripts and checkpoints, enabling reproducibility and community extensions.

Methodology

  1. Model family – The authors use the same transformer backbone for all experiments, swapping only the diffusion objective (masked, uniform, or interpolations).
  2. Noise interpolation – A scalar (\alpha) smoothly blends masked and uniform corruption, allowing a continuous sweep of diffusion types.
  3. Training regimes – Two primary axes are explored:
    • Compute‑bound: fixed FLOP budget, varying model size and data volume.
    • Data‑bound: fixed dataset size, scaling up parameters and compute.
  4. Hyper‑parameter sweeps – Systematic grid searches over batch size (from 256 to 8192) and learning‑rate schedules (linear warm‑up + cosine decay) to isolate their impact on scaling curves.
  5. Metrics – Standard cross‑entropy loss on a held‑out validation set, plus downstream zero‑shot tasks (e.g., cloze, QA) for qualitative sanity checks.
  6. Scaling law fitting – Power‑law fits of the form (L = A \cdot (C)^{-\beta} + B) (where (C) is compute) are performed separately for each noise type.

Results & Findings

Noise typeCompute‑bound scaling (loss)Data‑bound scaling (loss)Parameter‑efficiencyData‑efficiency
MaskedSteeper decline with more data; plateaus earlier with computeNeeds more data to reach low lossFavors smaller models when data is abundantLess favorable in data‑scarce regimes
UniformFlatter curve; similar asymptotic loss across sizesBetter loss with fewer data points, given enough parametersBenefits from larger models even with limited dataMore data‑efficient in compute‑constrained settings
Interpolated (mid‑range)Behaves between the two extremesShows transitional behavior; no clear advantage over extremes
  • The 10B uniform diffusion model achieved a validation loss within 2 % of the best‑performing ALM of comparable size, confirming that the predicted scaling law holds at the billion‑parameter scale.
  • Uniform diffusion models required ~30 % fewer training tokens to hit the same loss as masked diffusion under identical compute budgets.
  • Batch size scaling followed the classic “linear scaling rule” up to a point (≈4096), after which diminishing returns appeared, especially for masked diffusion.

Practical Implications

  • Compute‑constrained startups can opt for uniform diffusion LMs: invest in larger models but train on smaller curated datasets, reducing data acquisition costs.
  • Edge‑device fine‑tuning: Since uniform diffusion tolerates less data, developers can adapt a pre‑trained 10B diffusion model with a modest on‑device dataset, potentially yielding better sample efficiency than fine‑tuning an autoregressive counterpart.
  • Training pipelines: The paper’s batch‑size and learning‑rate recommendations can be directly plugged into existing transformer training scripts (e.g., DeepSpeed, Megatron‑LM) to accelerate diffusion‑LM experiments.
  • Research tooling: Open‑source checkpoints enable benchmarking diffusion models on downstream tasks (code generation, summarization) without the massive compute overhead of training from scratch.
  • Hybrid architectures: The smooth interpolation between noise types suggests a new design space where a model could dynamically switch diffusion regimes based on available compute or data, offering adaptive efficiency.

Limitations & Future Work

  • Task coverage – Evaluation is limited to language modeling loss and a few zero‑shot benchmarks; more extensive downstream task suites (e.g., reasoning, coding) are needed to gauge real‑world utility.
  • Hardware diversity – Experiments were run on NVIDIA A100 GPUs; scaling behavior on TPUs or newer GPU architectures may differ.
  • Energy considerations – While FLOPs are reported, actual energy consumption and carbon impact were not measured.
  • Theoretical grounding – The observed noise‑dependent scaling laws are empirically derived; a deeper theoretical explanation (e.g., information‑theoretic analysis) remains open.
  • Hybrid diffusion – Future work could explore adaptive or curriculum‑based noise schedules that transition from masked to uniform diffusion during training, potentially combining the strengths of both regimes.

Authors

  • Dimitri von Rütte
  • Janis Fluri
  • Omead Pooladzandi
  • Bernhard Schölkopf
  • Thomas Hofmann
  • Antonio Orvieto

Paper Information

  • arXiv ID: 2512.10858v1
  • Categories: cs.LG
  • Published: December 11, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »