[Paper] Universality of high-dimensional scaling limits of stochastic gradient descent

Published: 3 days ago (December 15, 2025 at 01:30 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2512.13634v1

Overview

The paper investigates why stochastic gradient descent (SGD) behaves so predictably on high‑dimensional learning problems, even when the data distribution deviates from the classic Gaussian assumption. By proving that the limiting ordinary differential equations (ODEs) governing SGD’s macroscopic dynamics are universal across a broad class of data models, the authors give developers a solid theoretical foundation for the robustness they often observe in practice.

Key Contributions

Universality theorem: Shows that the ODE limits of SGD’s summary statistics hold for any data drawn from product‑measure mixtures whose first two moments match those of an isotropic Gaussian, provided the initialization and ground‑truth vectors are sufficiently “delocalized” across coordinates.
Broad applicability: Covers common tasks such as classification with cross‑entropy loss on one‑ and two‑layer neural nets, and learning single‑ and multi‑index models with shallow networks.
Non‑universality counter‑examples: Demonstrates that if the initialization aligns with coordinate axes, the ODE limit can change, and that stochastic fluctuations (the SDE limits) are not universal.
Rigorous high‑dimensional scaling: Provides a mathematically precise regime where dimension → ∞, sample size → ∞, and learning rate → 0 at compatible rates, yielding deterministic ODE dynamics.

Methodology

Problem setup – The loss depends only on the projection of data onto a low‑dimensional subspace spanned by the model parameters and a few “ground‑truth” vectors. This abstraction captures many neural‑network training scenarios.
Data model – Instead of assuming a Gaussian mixture, the authors consider product‑measure mixtures (e.g., independent coordinates with arbitrary marginal distributions) that share the same mean and covariance as the Gaussian case.
Delocalized initialization – They require the initial weight vectors to have their mass spread over many coordinates (no single coordinate dominates). This mimics common random initializations (e.g., i.i.d. Gaussian or uniform).
Mean‑field scaling – As the ambient dimension (d) and the number of samples (n) grow proportionally, and the step size (\eta) shrinks like (1/d), the evolution of a finite set of summary statistics (inner products between weights and ground‑truth vectors) can be tracked.
Convergence to ODE – Using martingale techniques and concentration inequalities, they prove that the stochastic updates converge in probability to the solution of an autonomous ODE.
Non‑universality analysis – By constructing specific aligned initializations and examining the fluctuation SDEs, they identify scenarios where the universal ODE fails.

Results & Findings

Aspect	What the paper shows
ODE limit	The same deterministic ODE describes SGD dynamics for any product‑measure mixture matching the Gaussian’s first two moments, under delocalized initialization.
Practical tasks	The result holds for cross‑entropy classification with shallow nets and for learning index models, meaning many real‑world training pipelines fall under the theorem.
Failure modes	If the weight vector is aligned with a coordinate axis (e.g., a one‑hot initialization), the ODE changes—highlighting the importance of random, spread‑out initializations.
Fluctuations	The stochastic differential equation (SDE) that captures finite‑dimensional noise around the ODE’s fixed points is not universal; its coefficients depend on higher‑order moments of the data distribution.
Empirical alignment	Simulations (provided in the supplementary material) confirm that the ODE predictions match SGD trajectories for both Gaussian and non‑Gaussian product mixtures, as long as the delocalization condition holds.

Practical Implications

Confidence in standard initializations – Random, isotropic initializations (e.g., Xavier, He) automatically satisfy the delocalization requirement, so developers can expect the same macroscopic training dynamics across a wide range of data distributions.
Robustness to data preprocessing – Even if raw features are not Gaussian, as long as they are independent across dimensions and have matching first two moments, the high‑level SGD behavior remains predictable. This explains why many pipelines work “out‑of‑the‑box” after simple whitening or standardization.
Design of synthetic data for testing – When benchmarking algorithms, one can safely replace costly Gaussian mixture generators with simpler product‑measure generators without altering the theoretical training dynamics.
Guidance for curriculum learning – Since the ODE limit is insensitive to higher‑order moments, curriculum strategies that only affect skewness/kurtosis of the data will not change the overall convergence path, allowing developers to focus on altering the loss landscape instead.
Understanding failure cases – The non‑universality result warns against pathological initializations (e.g., sparse one‑hot vectors) that can lead to unexpected training dynamics, a useful diagnostic when training stalls.

Limitations & Future Work

Delocalization requirement – The universality hinges on the weight vectors being spread across many coordinates. Highly sparse or structured initializations (common in pruning or lottery‑ticket experiments) fall outside the proven regime.
Product‑measure assumption – Real‑world data often exhibits correlations across features; extending the theory to dependent coordinates remains an open challenge.
Finite‑dimensional effects – The ODE limit is asymptotic; the paper provides convergence rates but does not fully characterize how large (d) must be for the approximation to be accurate in practice.
Beyond shallow networks – While the analysis covers one‑ and two‑layer nets, extending universality to deep architectures with non‑linear activations is a natural next step.
Fluctuation non‑universality – Understanding how the non‑universal SDE terms affect generalization and escape from saddle points is left for future investigation.

Bottom line: For most everyday deep‑learning workflows that use random initializations and operate on high‑dimensional, roughly independent data, the macroscopic dynamics of SGD are governed by a universal ODE—regardless of whether the data is truly Gaussian. This theoretical guarantee helps developers trust that their training curves are not an artifact of a hidden Gaussian assumption, and it points the way toward more robust initialization and data‑generation practices.

Authors

Reza Gheissari
Aukosh Jagannath

Paper Information

arXiv ID: 2512.13634v1
Categories: stat.ML, cs.LG, math.PR, math.ST
Published: December 15, 2025
PDF: Download PDF

[Paper] Universality of high-dimensional scaling limits of stochastic gradient descent

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Spatia: Video Generation with Updatable Spatial Memory

[Paper] Predictive Concept Decoders: Training Scalable End-to-End Interpretability Assistants

[Paper] Artism: AI-Driven Dual-Engine System for Art Generation and Critique

[Paper] Learning Model Parameter Dynamics in a Combination Therapy for Bladder Cancer from Sparse Biological Data