[Paper] Scaling Behavior of Discrete Diffusion Language Models

Published: 1 month ago (December 11, 2025 at 12:54 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.10858v1

Overview

This paper investigates how Discrete Diffusion Language Models (DLMs) scale compared to the dominant autoregressive language models (ALMs). By systematically varying the diffusion noise (from masked to uniform) and tuning key hyper‑parameters, the authors uncover distinct scaling regimes that could make diffusion‑based models more compute‑ or data‑efficient in real‑world settings.

Key Contributions

Comprehensive scaling study of DLMs across a spectrum of diffusion noises (masked ↔ uniform).
Identification of noise‑dependent scaling laws: uniform diffusion favors parameter‑rich, data‑light regimes, while masked diffusion behaves oppositely.
Empirical validation of the predicted laws by training a 10‑billion‑parameter uniform diffusion model on ~(10^{22}) FLOPs—the largest publicly reported uniform diffusion LM.
Practical guidance on batch‑size and learning‑rate schedules for diffusion LMs, filling a gap left by prior work.
Open‑source release of training scripts and checkpoints, enabling reproducibility and community extensions.

Methodology

Model family – The authors use the same transformer backbone for all experiments, swapping only the diffusion objective (masked, uniform, or interpolations).
Noise interpolation – A scalar (\alpha) smoothly blends masked and uniform corruption, allowing a continuous sweep of diffusion types.
Training regimes – Two primary axes are explored:
- Compute‑bound: fixed FLOP budget, varying model size and data volume.
- Data‑bound: fixed dataset size, scaling up parameters and compute.
Hyper‑parameter sweeps – Systematic grid searches over batch size (from 256 to 8192) and learning‑rate schedules (linear warm‑up + cosine decay) to isolate their impact on scaling curves.
Metrics – Standard cross‑entropy loss on a held‑out validation set, plus downstream zero‑shot tasks (e.g., cloze, QA) for qualitative sanity checks.
Scaling law fitting – Power‑law fits of the form (L = A \cdot (C)^{-\beta} + B) (where (C) is compute) are performed separately for each noise type.

Results & Findings

Noise type	Compute‑bound scaling (loss)	Data‑bound scaling (loss)	Parameter‑efficiency	Data‑efficiency
Masked	Steeper decline with more data; plateaus earlier with compute	Needs more data to reach low loss	Favors smaller models when data is abundant	Less favorable in data‑scarce regimes
Uniform	Flatter curve; similar asymptotic loss across sizes	Better loss with fewer data points, given enough parameters	Benefits from larger models even with limited data	More data‑efficient in compute‑constrained settings
Interpolated (mid‑range)	Behaves between the two extremes	Shows transitional behavior; no clear advantage over extremes	—	—

The 10B uniform diffusion model achieved a validation loss within 2 % of the best‑performing ALM of comparable size, confirming that the predicted scaling law holds at the billion‑parameter scale.
Uniform diffusion models required ~30 % fewer training tokens to hit the same loss as masked diffusion under identical compute budgets.
Batch size scaling followed the classic “linear scaling rule” up to a point (≈4096), after which diminishing returns appeared, especially for masked diffusion.

Practical Implications

Compute‑constrained startups can opt for uniform diffusion LMs: invest in larger models but train on smaller curated datasets, reducing data acquisition costs.
Edge‑device fine‑tuning: Since uniform diffusion tolerates less data, developers can adapt a pre‑trained 10B diffusion model with a modest on‑device dataset, potentially yielding better sample efficiency than fine‑tuning an autoregressive counterpart.
Training pipelines: The paper’s batch‑size and learning‑rate recommendations can be directly plugged into existing transformer training scripts (e.g., DeepSpeed, Megatron‑LM) to accelerate diffusion‑LM experiments.
Research tooling: Open‑source checkpoints enable benchmarking diffusion models on downstream tasks (code generation, summarization) without the massive compute overhead of training from scratch.
Hybrid architectures: The smooth interpolation between noise types suggests a new design space where a model could dynamically switch diffusion regimes based on available compute or data, offering adaptive efficiency.

Limitations & Future Work

Task coverage – Evaluation is limited to language modeling loss and a few zero‑shot benchmarks; more extensive downstream task suites (e.g., reasoning, coding) are needed to gauge real‑world utility.
Hardware diversity – Experiments were run on NVIDIA A100 GPUs; scaling behavior on TPUs or newer GPU architectures may differ.
Energy considerations – While FLOPs are reported, actual energy consumption and carbon impact were not measured.
Theoretical grounding – The observed noise‑dependent scaling laws are empirically derived; a deeper theoretical explanation (e.g., information‑theoretic analysis) remains open.
Hybrid diffusion – Future work could explore adaptive or curriculum‑based noise schedules that transition from masked to uniform diffusion during training, potentially combining the strengths of both regimes.

Authors

Dimitri von Rütte
Janis Fluri
Omead Pooladzandi
Bernhard Schölkopf
Thomas Hofmann
Antonio Orvieto

Paper Information

arXiv ID: 2512.10858v1
Categories: cs.LG
Published: December 11, 2025
PDF: Download PDF

[Paper] Scaling Behavior of Discrete Diffusion Language Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Particulate: Feed-Forward 3D Object Articulation

[Paper] A General Algorithm for Detecting Higher-Order Interactions via Random Sequential Additions

[Paper] Softmax as Linear Attention in the Large-Prompt Regime: a Measure-based Perspective

[Paper] Super Suffixes: Bypassing Text Generation Alignment and Guard Models Simultaneously