[Paper] On the origin of neural scaling laws: from random graphs to natural language

Published: 3 weeks ago (January 15, 2026 at 01:46 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.10684v1

Overview

The paper investigates why neural networks—especially transformer language models—exhibit predictable scaling laws (performance improves smoothly as we increase data, compute, or parameters). By stripping language down to its barest form—random walks on graphs and simplified generative models—the authors show that scaling behavior emerges even when the data lack the heavy‑tailed, power‑law structure often blamed for it. This work bridges a gap between theoretical physics (random graphs) and practical AI, offering fresh insight into how and when scaling laws can be trusted.

Key Contributions

Demonstrated scaling without power‑law data: Showed that transformer models trained on random‑walk bigrams from Erdős‑Rényi and Barabási‑Albert graphs still obey neural scaling laws.
Systematic complexity sweep: Trained transformers on a hierarchy of language generators (4‑layer → 2‑layer → 1‑layer → bigram models) and observed a monotonic change in scaling exponents.
Reproduced classic language‑model scaling with tiny models: Achieved comparable scaling curves using 2‑layer transformers with a context length of 50 tokens, dramatically reducing the compute needed for experimental validation.
Critical review of fitting practices: Highlighted pitfalls in common curve‑fitting methods and proposed a more robust way to extract compute‑optimal trade‑offs.
Preliminary evidence for maximal update parameterization (μP): Suggested that μP can be more parameter‑efficient than the standard parameterization used in most scaling studies.

Methodology

Synthetic Graph Experiments
- Generated random graphs from two ensembles:
  - Erdős‑Rényi (ER): edges placed uniformly at random.
  - Barabási‑Albert (BA): preferential attachment yielding scale‑free degree distributions.
- Ran random walks on these graphs and recorded consecutive node pairs (bigrams) as training sequences.
- Trained transformer models of varying depth/width on these bigram streams, sweeping model size, dataset size, and compute budget.
Language Complexity Ladder
- Built a cascade of generative language models:
  - Full‑scale 4‑layer transformer LM → 2‑layer LM → 1‑layer LM → simple bigram model.
- Sampled sequences from each generator and trained a fixed 2‑layer transformer on them, again varying data and model scale.
Scaling Curve Extraction
- Measured validation loss (cross‑entropy) across a grid of (N, D, C): number of parameters (N), training tokens (D), and compute (FLOPs, C).
- Fit power‑law relationships of the form
  
  [ L \approx A \cdot N^{-\alpha} + B \cdot D^{-\beta} + C \cdot C^{-\gamma}, ]
  
  testing multiple regression techniques and Bayesian model comparison.
Compute‑Optimal Analysis
- Compared the classic “Pareto‑optimal” curves (where increasing one resource yields diminishing returns) with an alternative derived from the fitted exponents, showing where prior literature may have over‑ or under‑estimated optimal budgets.
Parameterization Test (μP vs. Standard)
- Re‑trained a subset of experiments using maximal update parameterization, tracking how quickly loss improves per parameter added.

Results & Findings

Experiment	Scaling Law Observed	Exponent Trend	Notable Insight
Random walks on ER graphs	(L \propto N^{-0.31} \cdot D^{-0.27} \cdot C^{-0.22})	Exponents stable across graph densities	Scaling emerges despite completely uniform edge probabilities.
Random walks on BA graphs	Similar power‑law form, slightly steeper exponents (≈ ‑0.35)	Reflects higher structural heterogeneity	Even scale‑free topology does not change the qualitative law.
Language‑complexity ladder	Exponents gradually increase from bigram (≈ ‑0.20) to 4‑layer LM (≈ ‑0.33)	Monotonic relationship between data complexity and scaling strength	Suggests scaling exponents encode intrinsic data “richness.”
Tiny 2‑layer transformer (context = 50)	Replicates classic LM scaling curves within 5 % error	Demonstrates that large‑scale experiments can be approximated with modest resources	Enables rapid prototyping of scaling hypotheses.
μP vs. standard	μP achieves comparable loss with ~30 % fewer parameters	Parameter efficiency gain	Points to a practical re‑parameterization for future scaling studies.

Overall, the authors confirm that neural scaling laws are a robust emergent phenomenon, not merely a by‑product of power‑law data statistics.

Practical Implications

Rapid Scaling Experiments: Developers can now test scaling hypotheses on cheap 2‑layer models with short contexts, saving compute budgets while still obtaining reliable exponent estimates.
Resource Allocation Planning: The refined compute‑optimal curves give clearer guidance on whether to invest in more data, larger models, or faster hardware for a given performance target.
Model Design Choices: The evidence that maximal update parameterization yields better parameter efficiency suggests a low‑overhead switch for training pipelines, especially in research labs with limited GPU memory.
Benchmarking Simplified Tasks: Random‑walk bigram tasks provide a lightweight sandbox for debugging scaling‑related bugs (e.g., learning‑rate schedules, optimizer stability) before scaling up to full‑language corpora.
Interpretability of Scaling Exponents: Since exponents correlate with data complexity, monitoring how they shift when adding new data domains (code, multimodal text, etc.) could serve as an early indicator of diminishing returns.

Limitations & Future Work

Synthetic vs. Real‑World Data: While random walks and simplified language models capture essential dynamics, they omit many linguistic phenomena (syntax, long‑range dependencies) that may affect scaling at larger model sizes.
Model Architecture Scope: The study focuses on vanilla transformers; it remains open how architectural tweaks (e.g., retrieval‑augmented models, sparsity) alter the observed laws.
Compute‑Optimal Derivation Assumptions: The alternative optimal curves rely on fitted exponents staying constant across orders of magnitude—a hypothesis that needs validation on truly massive models.
μP Generalization: Preliminary results are promising, but broader experiments across diverse tasks (vision, reinforcement learning) are required to confirm the universality of maximal update parameterization.

The authors propose extending the graph‑based framework to heterogeneous graphs (e.g., knowledge graphs) and exploring multimodal scaling as next steps.

Authors

Maissam Barkeshli
Alberto Alfarano
Andrey Gromov

Paper Information

arXiv ID: 2601.10684v1
Categories: cs.LG, cond-mat.dis-nn, cs.AI, stat.ML
Published: January 15, 2026
PDF: Download PDF

[Paper] On the origin of neural scaling laws: from random graphs to natural language

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Do explanations generalize across large reasoning models?

[Paper] Building Production-Ready Probes For Gemini

[Paper] ShapeR: Robust Conditional 3D Shape Generation from Casual Captures

[Paper] MetaboNet: The Largest Publicly Available Consolidated Dataset for Type 1 Diabetes Management