[Paper] Uniform-in-time concentration in two-layer neural networks via transportation inequalities
Source: arXiv - 2603.01842v1
Overview
The paper shows that the predictions of a two‑layer neural network trained with stochastic gradient descent (SGD) stay uniformly close to their mean‑field (infinite‑width) limit over the entire training horizon, and it does so with explicit, high‑probability guarantees. By proving new transportation‑inequality bounds for the distribution of SGD parameters, the authors obtain dimension‑free concentration results that translate directly into tight prediction‑error estimates.
Key Contributions
- Uniform‑in‑time concentration: Guarantees that the empirical distribution of network parameters never drifts far from its mean‑field limit, regardless of how many SGD steps are taken.
- Transportation inequalities for SGD: Establishes (T_p) (for (p=1,2)) inequalities with constants independent of the iteration index, a novel tool for analyzing stochastic optimization dynamics.
- Wasserstein‑1 and sliced‑Wasserstein bounds: Provides explicit rates for the distance between the empirical parameter measure and its limit in both (W_1) and the dimension‑free sliced‑(W_1) metric.
- Prediction‑error translation: Shows how the Wasserstein concentration directly bounds the error of network predictions against any fixed test function (\Phi).
- Explicit constants: All bounds come with concrete constants (depending only on problem data such as loss curvature, regularization strength, and step size), making the results amenable to practical interpretation.
Methodology
- Mean‑field formulation: The authors view the two‑layer network as a particle system where each hidden neuron is a particle. In the infinite‑width limit, the empirical particle distribution converges to a deterministic measure governed by a McKean‑Vlasov PDE.
- SGD dynamics as a Markov chain: The discrete SGD updates are written as a stochastic recursion for the particle parameters. By treating each iteration as a transition of a Markov kernel, they can study the law of the whole parameter vector.
- Transportation‑inequality proof: Using a combination of log‑Sobolev and Poincaré inequalities tailored to the SGD kernel, they prove that the law of the parameters satisfies a (T_p) inequality with a constant that does not grow with the iteration count.
- Concentration via Martingale arguments: With the (T_p) inequality in hand, they apply standard concentration of measure tools (e.g., Herbst’s argument) to bound deviations of empirical measures from their expectation uniformly over time.
- Wasserstein distance analysis: The concentration bounds are expressed in terms of the (W_1) distance between the empirical parameter measure and its mean‑field limit. For sliced‑(W_1), they integrate over random one‑dimensional projections, which removes any dependence on the ambient dimension.
- Error translation: Finally, they use the Lipschitz property of the network’s output functional (with respect to the parameter measure) to convert the Wasserstein bounds into concrete prediction‑error guarantees for any test function (\Phi).
Results & Findings
- Uniform concentration: With probability at least (1-\delta), for all SGD steps (k) up to any horizon (T),
$$ W_1\big(\mu_k^{\text{emp}}, \mu_k^{\text{MF}}\big) \le C \sqrt{\frac{\log(1/\delta)}{N}} , $$ where (N) is the number of hidden neurons and (C) is an explicit constant independent of (k). - Dimension‑free sliced‑(W_1) bound: The same rate holds for the sliced‑(W_1) distance, eliminating any curse of dimensionality.
- Prediction error: For any Lipschitz test function (\Phi), the network’s output error satisfies
$$ |\mathbb{E}{\text{SGD}}[\Phi(f{\theta_k})] - \Phi(f_{\mu_k^{\text{MF}}})| \le L_\Phi C \sqrt{\frac{\log(1/\delta)}{N}} , $$ where (L_\Phi) is the Lipschitz constant of (\Phi). - Explicit dependence on hyper‑parameters: The constants capture the learning rate, ridge regularization strength, and the smoothness of the quadratic loss, enabling practitioners to see how tuning these parameters affects concentration.
Practical Implications
- Confidence in wide‑network training: Developers can now claim, with quantitative backing, that a sufficiently wide two‑layer network trained by SGD will behave almost exactly like its infinite‑width counterpart throughout training—not just asymptotically.
- Guidance for network sizing: The (1/\sqrt{N}) rate tells engineers how many hidden units are needed to achieve a target prediction‑error tolerance, given a desired confidence level.
- Hyper‑parameter selection: Since the constants are explicit, one can analytically assess the trade‑off between learning‑rate, regularization, and convergence speed, potentially reducing the need for extensive grid searches.
- Robustness to dimensionality: The sliced‑(W_1) result means that even for high‑dimensional input data, the concentration guarantees remain tight, supporting the use of wide shallow nets in domains like computer vision or genomics where input dimensions are large.
- Foundation for algorithmic extensions: The transportation‑inequality framework can be adapted to other stochastic optimizers (e.g., Adam, RMSProp) or to deeper architectures, opening a path for provable performance guarantees in more realistic settings.
Limitations & Future Work
- Two‑layer restriction: The analysis is limited to shallow networks; extending the uniform‑in‑time concentration to deep architectures remains an open challenge.
- Quadratic loss & ridge regularization: The proofs rely heavily on the convexity and smoothness of the quadratic loss; handling classification losses (e.g., cross‑entropy) would require new techniques.
- Discrete‑time vs. continuous‑time: While the results hold for the discrete SGD iterates, they assume a fixed step size and do not address adaptive learning‑rate schedules commonly used in practice.
- Finite‑sample constants: Although explicit, the constants can be conservative; tighter, data‑dependent bounds could make the theory more directly actionable.
- Beyond mean‑field: Investigating whether similar uniform concentration holds when the mean‑field limit itself evolves (e.g., due to non‑stationary data streams) is a promising direction.
Overall, the paper delivers a rigorous, developer‑friendly toolkit for understanding how wide shallow networks trained with SGD stay close to their idealized mean‑field behavior over time, and it paves the way for more robust, theoretically grounded deep‑learning practice.
Authors
- Arnaud Guillin
- Boris Nectoux
- Paul Stos
Paper Information
- arXiv ID: 2603.01842v1
- Categories: cs.NE, math.PR
- Published: March 2, 2026
- PDF: Download PDF