[Paper] A Noise Sensitivity Exponent Controls Large Statistical-to-Computational Gaps in Single- and Multi-Index Models
Source: arXiv - 2603.17896v1
Overview
The paper uncovers a surprisingly simple predictor of when high‑dimensional learning problems are statistically feasible but computationally intractable. By focusing on single‑ and multi‑index models—canonical testbeds for feature discovery—the authors show that a single number, the Noise Sensitivity Exponent (NSE), derived from the activation function, dictates the size of the statistical‑to‑computational gap across a wide range of settings.
Key Contributions
- Unified metric: Introduces the Noise Sensitivity Exponent (NSE) as a single scalar that captures the interplay between noise robustness and algorithmic hardness.
- Single‑index models: Proves that, when additive noise is large, the NSE exactly characterizes the point at which polynomial‑time algorithms (e.g., gradient descent, spectral methods) fail despite the existence of an information‑theoretic estimator.
- Multi‑index specialization: Shows that in separable multi‑index models the same NSE controls a specialization transition: components become learnable one by one, creating a hierarchy of statistical‑computational gaps.
- Hierarchical learning rates: Demonstrates that in hierarchical multi‑index models the NSE determines the optimal computational rate at which different latent directions can be sequentially recovered.
- Broad applicability: Results hold for a wide class of activation functions (ReLU, sigmoid, polynomial, etc.) and for both Gaussian and sub‑Gaussian data distributions.
Methodology
- Model setup – The authors consider data generated as
[ y = f(\langle w_1, x\rangle, \dots, \langle w_k, x\rangle) + \xi, ]
where (x\in\mathbb{R}^d) is high‑dimensional, (w_i) are unknown “index” vectors, (f) is a separable activation, and (\xi) is additive noise. - Defining NSE – For a scalar activation (\phi), the NSE is defined as
[ \mathrm{NSE}(\phi) = \frac{\mathbb{E}[\phi’(Z)^2]}{\mathbb{E}[\phi(Z)^2]}, ]
with (Z\sim\mathcal N(0,1)). Intuitively, it measures how much the output changes under small perturbations of the input. - Statistical‑computational analysis –
- Information‑theoretic threshold is derived using replica‑type calculations and the conditional mutual information between (y) and the indices given the data.
- Algorithmic threshold is obtained by analyzing the dynamics of the approximate message passing (AMP) algorithm and by constructing reductions to known hard problems (e.g., planted clique).
- Phase diagram construction – By varying the noise variance and the NSE, the authors map out regions where learning is easy, hard, or impossible, for both single‑ and multi‑index settings.
- Extension to hierarchies – They introduce a multi‑scale version of AMP that learns one direction at a time, and prove that the speed of each stage is inversely proportional to the NSE of the corresponding activation.
Results & Findings
| Setting | Statistical threshold | Computational threshold | Gap condition |
|---|---|---|---|
| Single‑index, large noise | (\sigma^2 > \sigma_{\text{info}}^2 \propto 1/\text{NSE}) | (\sigma^2 > \sigma_{\text{alg}}^2 \propto 1/(\text{NSE})^{2}) | Gap appears when NSE is small (e.g., ReLU) |
| Separable multi‑index | Each component becomes learnable when its signal‑to‑noise exceeds (1/\text{NSE}) | Algorithmic learning of component (i) requires signal‑to‑noise (> 1/(\text{NSE})^{2}) | Leads to a stepwise gap across components |
| Hierarchical multi‑index | Same statistical limits as above | Optimal learning rate for component (i) is (\Theta(\text{NSE}^{-1})) | The NSE dictates the tempo of sequential learning |
Interpretation:
- Low NSE (e.g., piecewise‑linear activations) → the model is highly sensitive to noise, making it easy statistically but hard computationally.
- High NSE (e.g., smooth sigmoids) → noise robustness aligns statistical and computational thresholds, shrinking the gap.
The authors also provide concrete examples: for ReLU, (\text{NSE}=1) yields a quadratic gap between the information‑theoretic and algorithmic noise levels; for a cubic activation, (\text{NSE}=3) reduces the gap dramatically.
Practical Implications
- Model selection: When designing deep or shallow networks for high‑dimensional tasks (e.g., genomics, recommendation systems), the NSE offers a quick diagnostic: choose activations with higher NSE to avoid hidden computational bottlenecks.
- Algorithm design: The analysis suggests that spectral or AMP‑style methods are optimal only when the NSE is large; otherwise, one may need to resort to more sophisticated non‑convex optimization (e.g., tensor methods) or accept sub‑optimal performance.
- Feature engineering: In multi‑task or multi‑output settings, the specialization transition indicates that some outputs may be learnable far earlier than others. Practitioners can prioritize data collection or regularization for the “hard” components identified by a low NSE.
- Benchmarking: The NSE can be used to construct synthetic datasets that deliberately exhibit a statistical‑computational gap, providing a rigorous stress test for new learning algorithms.
- Hardware considerations: Since the optimal computational rate in hierarchical models scales with (1/\text{NSE}), hardware‑accelerated pipelines (e.g., GPUs) can be tuned to allocate more resources to low‑NSE layers where learning is slower.
Limitations & Future Work
- Assumptions on data distribution: The theory relies on Gaussian or sub‑Gaussian input features; extending to heavy‑tailed or structured data (e.g., images) remains open.
- Finite‑sample effects: The analysis is asymptotic in the high‑dimensional limit; empirical validation on moderate‑size datasets shows promising trends but quantifying finite‑sample corrections is needed.
- Algorithmic scope: While AMP captures a broad class of iterative methods, many practical optimizers (Adam, SGD with momentum) fall outside the current proof techniques. Bridging this gap could yield tighter computational thresholds for real‑world training.
- Beyond separable activations: The current NSE definition assumes separable (component‑wise) activations. Generalizing to deep, non‑separable architectures is a natural next step.
- Adaptive NSE: Investigating whether training dynamics can increase the effective NSE (e.g., via learned activation functions) could lead to algorithms that self‑regularize against computational hardness.
Authors
- Leonardo Defilippis
- Florent Krzakala
- Bruno Loureiro
- Antoine Maillard
Paper Information
- arXiv ID: 2603.17896v1
- Categories: stat.ML, cs.LG
- Published: March 18, 2026
- PDF: Download PDF