[Paper] On the origin of neural scaling laws: from random graphs to natural language

Published: (January 15, 2026 at 01:46 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.10684v1

Overview

The paper investigates why neural networks—especially transformer language models—exhibit predictable scaling laws (performance improves smoothly as we increase data, compute, or parameters). By stripping language down to its barest form—random walks on graphs and simplified generative models—the authors show that scaling behavior emerges even when the data lack the heavy‑tailed, power‑law structure often blamed for it. This work bridges a gap between theoretical physics (random graphs) and practical AI, offering fresh insight into how and when scaling laws can be trusted.

Key Contributions

  • Demonstrated scaling without power‑law data: Showed that transformer models trained on random‑walk bigrams from Erdős‑Rényi and Barabási‑Albert graphs still obey neural scaling laws.
  • Systematic complexity sweep: Trained transformers on a hierarchy of language generators (4‑layer → 2‑layer → 1‑layer → bigram models) and observed a monotonic change in scaling exponents.
  • Reproduced classic language‑model scaling with tiny models: Achieved comparable scaling curves using 2‑layer transformers with a context length of 50 tokens, dramatically reducing the compute needed for experimental validation.
  • Critical review of fitting practices: Highlighted pitfalls in common curve‑fitting methods and proposed a more robust way to extract compute‑optimal trade‑offs.
  • Preliminary evidence for maximal update parameterization (μP): Suggested that μP can be more parameter‑efficient than the standard parameterization used in most scaling studies.

Methodology

  1. Synthetic Graph Experiments

    • Generated random graphs from two ensembles:
      • Erdős‑Rényi (ER): edges placed uniformly at random.
      • Barabási‑Albert (BA): preferential attachment yielding scale‑free degree distributions.
    • Ran random walks on these graphs and recorded consecutive node pairs (bigrams) as training sequences.
    • Trained transformer models of varying depth/width on these bigram streams, sweeping model size, dataset size, and compute budget.
  2. Language Complexity Ladder

    • Built a cascade of generative language models:
      • Full‑scale 4‑layer transformer LM → 2‑layer LM → 1‑layer LM → simple bigram model.
    • Sampled sequences from each generator and trained a fixed 2‑layer transformer on them, again varying data and model scale.
  3. Scaling Curve Extraction

    • Measured validation loss (cross‑entropy) across a grid of (N, D, C): number of parameters (N), training tokens (D), and compute (FLOPs, C).

    • Fit power‑law relationships of the form

      [ L \approx A \cdot N^{-\alpha} + B \cdot D^{-\beta} + C \cdot C^{-\gamma}, ]

      testing multiple regression techniques and Bayesian model comparison.

  4. Compute‑Optimal Analysis

    • Compared the classic “Pareto‑optimal” curves (where increasing one resource yields diminishing returns) with an alternative derived from the fitted exponents, showing where prior literature may have over‑ or under‑estimated optimal budgets.
  5. Parameterization Test (μP vs. Standard)

    • Re‑trained a subset of experiments using maximal update parameterization, tracking how quickly loss improves per parameter added.

Results & Findings

ExperimentScaling Law ObservedExponent TrendNotable Insight
Random walks on ER graphs(L \propto N^{-0.31} \cdot D^{-0.27} \cdot C^{-0.22})Exponents stable across graph densitiesScaling emerges despite completely uniform edge probabilities.
Random walks on BA graphsSimilar power‑law form, slightly steeper exponents (≈ ‑0.35)Reflects higher structural heterogeneityEven scale‑free topology does not change the qualitative law.
Language‑complexity ladderExponents gradually increase from bigram (≈ ‑0.20) to 4‑layer LM (≈ ‑0.33)Monotonic relationship between data complexity and scaling strengthSuggests scaling exponents encode intrinsic data “richness.”
Tiny 2‑layer transformer (context = 50)Replicates classic LM scaling curves within 5 % errorDemonstrates that large‑scale experiments can be approximated with modest resourcesEnables rapid prototyping of scaling hypotheses.
μP vs. standardμP achieves comparable loss with ~30 % fewer parametersParameter efficiency gainPoints to a practical re‑parameterization for future scaling studies.

Overall, the authors confirm that neural scaling laws are a robust emergent phenomenon, not merely a by‑product of power‑law data statistics.

Practical Implications

  • Rapid Scaling Experiments: Developers can now test scaling hypotheses on cheap 2‑layer models with short contexts, saving compute budgets while still obtaining reliable exponent estimates.
  • Resource Allocation Planning: The refined compute‑optimal curves give clearer guidance on whether to invest in more data, larger models, or faster hardware for a given performance target.
  • Model Design Choices: The evidence that maximal update parameterization yields better parameter efficiency suggests a low‑overhead switch for training pipelines, especially in research labs with limited GPU memory.
  • Benchmarking Simplified Tasks: Random‑walk bigram tasks provide a lightweight sandbox for debugging scaling‑related bugs (e.g., learning‑rate schedules, optimizer stability) before scaling up to full‑language corpora.
  • Interpretability of Scaling Exponents: Since exponents correlate with data complexity, monitoring how they shift when adding new data domains (code, multimodal text, etc.) could serve as an early indicator of diminishing returns.

Limitations & Future Work

  • Synthetic vs. Real‑World Data: While random walks and simplified language models capture essential dynamics, they omit many linguistic phenomena (syntax, long‑range dependencies) that may affect scaling at larger model sizes.
  • Model Architecture Scope: The study focuses on vanilla transformers; it remains open how architectural tweaks (e.g., retrieval‑augmented models, sparsity) alter the observed laws.
  • Compute‑Optimal Derivation Assumptions: The alternative optimal curves rely on fitted exponents staying constant across orders of magnitude—a hypothesis that needs validation on truly massive models.
  • μP Generalization: Preliminary results are promising, but broader experiments across diverse tasks (vision, reinforcement learning) are required to confirm the universality of maximal update parameterization.

The authors propose extending the graph‑based framework to heterogeneous graphs (e.g., knowledge graphs) and exploring multimodal scaling as next steps.

Authors

  • Maissam Barkeshli
  • Alberto Alfarano
  • Andrey Gromov

Paper Information

  • arXiv ID: 2601.10684v1
  • Categories: cs.LG, cond-mat.dis-nn, cs.AI, stat.ML
  • Published: January 15, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »