[Paper] Estimating the expected output of wide random MLPs more efficiently than sampling
Source: arXiv - 2605.05179v1
Overview
The paper introduces a new way to compute the expected output of a wide, randomly‑initialized multilayer perceptron (MLP) without having to run a massive Monte‑Carlo simulation. By analytically tracking the distribution of activations layer‑by‑layer, the authors can estimate means, variances, and even rare‑event probabilities with far fewer floating‑point operations than traditional sampling. This opens the door to faster diagnostics, safer model initialization, and more efficient training pipelines.
Key Contributions
- Closed‑form estimator for the expected output of wide random MLPs under Gaussian inputs, built on cumulants and Hermite polynomial expansions.
- Theoretical guarantees that the estimator’s mean‑squared error (MSE) decays faster than the (O(1/N)) rate of Monte‑Carlo sampling when network width (w) is large.
- Empirical validation showing orders‑of‑magnitude FLOP savings across several architectures (e.g., 2‑layer, 4‑layer, and deep residual MLPs).
- Rare‑event analysis demonstrating accurate tail‑probability estimates (e.g., probability that a neuron output exceeds a high threshold) where sampling would need billions of draws.
- Proof‑of‑concept training where the estimator replaces the forward pass in a small‑scale loss‑gradient computation, yielding comparable convergence with reduced compute.
Methodology
- Gaussian Input Assumption – The input vector is modeled as i.i.d. standard normal. This matches common practice for weight‑initialization analysis and simplifies the mathematics.
- Layer‑wise Distribution Propagation – Starting from the input distribution, the authors propagate approximate moments through each affine‑ReLU (or other activation) block.
- Cumulants (e.g., mean, variance, skewness) are updated analytically using known formulas for linear transforms and for the ReLU’s effect on Gaussian variables.
- Hermite expansions express the non‑linear activation’s effect as a series of orthogonal polynomials, truncating after a few terms because higher‑order coefficients vanish quickly in wide networks.
- Wide‑Network Approximation – When the hidden width (w \to \infty), the law of large numbers forces the empirical activation distribution to concentrate around its theoretical moment‑based approximation. The estimator leverages this concentration to ignore sampling noise.
- Error Control – The authors bound the truncation error of the Hermite series and the deviation caused by finite width, yielding an overall MSE bound that scales as (\tilde O(1/w^2)).
- Implementation – The estimator is a lightweight routine that only requires matrix multiplications for the linear parts and a few scalar updates for the moment terms—no forward‑pass through the full network.
Results & Findings
| Experiment | Baseline (Monte‑Carlo) | Proposed Estimator | FLOP Reduction | MSE (Target) |
|---|---|---|---|---|
| 2‑layer MLP, width = 1024 | 10⁶ samples | 1 000 Hermite terms | ≈ 200× | ≤ 10⁻⁴ |
| 4‑layer MLP, width = 2048 | 10⁶ samples | 2 500 terms | ≈ 350× | ≤ 5·10⁻⁵ |
| Tail probability (output > 3σ) | 10⁹ samples needed for stable estimate | 5 000 terms | ≈ 10⁶× | ≤ 10⁻³ |
- Accuracy: Across depths and widths, the estimator matches Monte‑Carlo averages within the prescribed MSE, even for highly non‑linear activations (ReLU, GELU).
- Rare‑Event Estimation: For events with probability < 10⁻⁶, the estimator remains stable, whereas sampling variance dominates unless billions of draws are used.
- Training Demo: Replacing the forward pass with the estimator in a simple regression task on synthetic data yields comparable loss curves while cutting per‑epoch compute by ~70 %.
Practical Implications
- Faster Model Auditing: Engineers can quickly evaluate expected loss, gradient norms, or activation statistics of a freshly‑initialized network without costly forward passes—useful for hyper‑parameter sweeps or architecture search.
- Safety‑Critical Systems: Accurate tail‑risk estimates enable early detection of catastrophic failure modes (e.g., extreme activations that could cause overflow or saturate downstream components).
- Resource‑Constrained Training: In settings where GPU time is at a premium (edge devices, large‑scale hyper‑parameter optimization), the estimator can replace expensive Monte‑Carlo validation loops.
- Theoretical Tooling: The moment‑propagation framework can be extended to other architectures (CNNs, transformers) that exhibit width‑wise concentration, providing a new analytical lens for initialization strategies.
Limitations & Future Work
- Gaussian Input Restriction: The current derivation assumes i.i.d. normal inputs; extending to structured data (images, text embeddings) will require additional approximations.
- Activation Diversity: While ReLU and GELU are handled, exotic activations (swish, softmax) may need higher‑order Hermite terms, increasing computational overhead.
- Finite‑Width Effects: For narrow layers (≤ 64 units) the concentration guarantees weaken, and the estimator’s error grows; hybrid sampling‑plus‑analytic schemes could bridge this gap.
- Scalability to Deep Nets: The paper demonstrates up to 8 layers; deeper networks may accumulate truncation errors, suggesting adaptive term selection or variance‑reduction tricks.
- Training Integration: The proof‑of‑concept training experiment is preliminary; future work should explore back‑propagation through the moment‑based estimator and its impact on convergence dynamics.
Bottom line: By turning the forward pass of a wide random MLP into a tractable moment‑calculation, this work offers a practical shortcut for developers who need fast, reliable estimates of model behavior—especially when probing the tails of the output distribution where traditional sampling is prohibitively expensive.
Authors
- Wilson Wu
- Victor Lecomte
- Michael Winer
- George Robinson
- Jacob Hilton
- Paul Christiano
Paper Information
- arXiv ID: 2605.05179v1
- Categories: cs.LG, cond-mat.dis-nn, stat.ML
- Published: May 6, 2026
- PDF: Download PDF