[Paper] Universality of high-dimensional scaling limits of stochastic gradient descent
Source: arXiv - 2512.13634v1
Overview
The paper investigates why stochastic gradient descent (SGD) behaves so predictably on high‑dimensional learning problems, even when the data distribution deviates from the classic Gaussian assumption. By proving that the limiting ordinary differential equations (ODEs) governing SGD’s macroscopic dynamics are universal across a broad class of data models, the authors give developers a solid theoretical foundation for the robustness they often observe in practice.
Key Contributions
- Universality theorem: Shows that the ODE limits of SGD’s summary statistics hold for any data drawn from product‑measure mixtures whose first two moments match those of an isotropic Gaussian, provided the initialization and ground‑truth vectors are sufficiently “delocalized” across coordinates.
- Broad applicability: Covers common tasks such as classification with cross‑entropy loss on one‑ and two‑layer neural nets, and learning single‑ and multi‑index models with shallow networks.
- Non‑universality counter‑examples: Demonstrates that if the initialization aligns with coordinate axes, the ODE limit can change, and that stochastic fluctuations (the SDE limits) are not universal.
- Rigorous high‑dimensional scaling: Provides a mathematically precise regime where dimension → ∞, sample size → ∞, and learning rate → 0 at compatible rates, yielding deterministic ODE dynamics.
Methodology
- Problem setup – The loss depends only on the projection of data onto a low‑dimensional subspace spanned by the model parameters and a few “ground‑truth” vectors. This abstraction captures many neural‑network training scenarios.
- Data model – Instead of assuming a Gaussian mixture, the authors consider product‑measure mixtures (e.g., independent coordinates with arbitrary marginal distributions) that share the same mean and covariance as the Gaussian case.
- Delocalized initialization – They require the initial weight vectors to have their mass spread over many coordinates (no single coordinate dominates). This mimics common random initializations (e.g., i.i.d. Gaussian or uniform).
- Mean‑field scaling – As the ambient dimension (d) and the number of samples (n) grow proportionally, and the step size (\eta) shrinks like (1/d), the evolution of a finite set of summary statistics (inner products between weights and ground‑truth vectors) can be tracked.
- Convergence to ODE – Using martingale techniques and concentration inequalities, they prove that the stochastic updates converge in probability to the solution of an autonomous ODE.
- Non‑universality analysis – By constructing specific aligned initializations and examining the fluctuation SDEs, they identify scenarios where the universal ODE fails.
Results & Findings
| Aspect | What the paper shows |
|---|---|
| ODE limit | The same deterministic ODE describes SGD dynamics for any product‑measure mixture matching the Gaussian’s first two moments, under delocalized initialization. |
| Practical tasks | The result holds for cross‑entropy classification with shallow nets and for learning index models, meaning many real‑world training pipelines fall under the theorem. |
| Failure modes | If the weight vector is aligned with a coordinate axis (e.g., a one‑hot initialization), the ODE changes—highlighting the importance of random, spread‑out initializations. |
| Fluctuations | The stochastic differential equation (SDE) that captures finite‑dimensional noise around the ODE’s fixed points is not universal; its coefficients depend on higher‑order moments of the data distribution. |
| Empirical alignment | Simulations (provided in the supplementary material) confirm that the ODE predictions match SGD trajectories for both Gaussian and non‑Gaussian product mixtures, as long as the delocalization condition holds. |
Practical Implications
- Confidence in standard initializations – Random, isotropic initializations (e.g., Xavier, He) automatically satisfy the delocalization requirement, so developers can expect the same macroscopic training dynamics across a wide range of data distributions.
- Robustness to data preprocessing – Even if raw features are not Gaussian, as long as they are independent across dimensions and have matching first two moments, the high‑level SGD behavior remains predictable. This explains why many pipelines work “out‑of‑the‑box” after simple whitening or standardization.
- Design of synthetic data for testing – When benchmarking algorithms, one can safely replace costly Gaussian mixture generators with simpler product‑measure generators without altering the theoretical training dynamics.
- Guidance for curriculum learning – Since the ODE limit is insensitive to higher‑order moments, curriculum strategies that only affect skewness/kurtosis of the data will not change the overall convergence path, allowing developers to focus on altering the loss landscape instead.
- Understanding failure cases – The non‑universality result warns against pathological initializations (e.g., sparse one‑hot vectors) that can lead to unexpected training dynamics, a useful diagnostic when training stalls.
Limitations & Future Work
- Delocalization requirement – The universality hinges on the weight vectors being spread across many coordinates. Highly sparse or structured initializations (common in pruning or lottery‑ticket experiments) fall outside the proven regime.
- Product‑measure assumption – Real‑world data often exhibits correlations across features; extending the theory to dependent coordinates remains an open challenge.
- Finite‑dimensional effects – The ODE limit is asymptotic; the paper provides convergence rates but does not fully characterize how large (d) must be for the approximation to be accurate in practice.
- Beyond shallow networks – While the analysis covers one‑ and two‑layer nets, extending universality to deep architectures with non‑linear activations is a natural next step.
- Fluctuation non‑universality – Understanding how the non‑universal SDE terms affect generalization and escape from saddle points is left for future investigation.
Bottom line: For most everyday deep‑learning workflows that use random initializations and operate on high‑dimensional, roughly independent data, the macroscopic dynamics of SGD are governed by a universal ODE—regardless of whether the data is truly Gaussian. This theoretical guarantee helps developers trust that their training curves are not an artifact of a hidden Gaussian assumption, and it points the way toward more robust initialization and data‑generation practices.
Authors
- Reza Gheissari
- Aukosh Jagannath
Paper Information
- arXiv ID: 2512.13634v1
- Categories: stat.ML, cs.LG, math.PR, math.ST
- Published: December 15, 2025
- PDF: Download PDF