[Paper] Provably robust learning of regression neural networks using $β$-divergences
Source: arXiv - 2602.08933v1
Overview
The paper introduces rRNet, a new training framework for regression‑type neural networks that is provably robust to outliers and contaminated data. By leveraging the β‑divergence (also known as density‑power divergence), the authors replace the usual mean‑squared‑error loss with a family of loss functions that can automatically down‑weight suspicious samples while still retaining the familiar maximum‑likelihood case as a special setting.
Key Contributions
- β‑divergence loss for regression NNs – a unified objective that works with smooth or non‑smooth activations and with a wide range of error distributions.
- Alternating optimization algorithm with provable convergence to stationary points under mild, checkable conditions.
- Theoretical robustness guarantees: bounded influence functions for both parameters and predictions, and an asymptotic breakdown point of 50 % for any β ∈ (0, 1].
- Recovery of classic MLE when β → 0, so existing pipelines can be switched to rRNet with a single hyper‑parameter tweak.
- Extensive empirical validation on synthetic benchmarks and real‑world regression tasks, showing superior performance over standard MSE training and several ad‑hoc robust tricks (e.g., Huber loss, data clipping).
Methodology
-
Loss formulation – Instead of minimizing
[ \frac{1}{n}\sum_{i}(y_i-\hat y_i)^2, ]
rRNet minimizes the β‑divergence between the empirical data distribution and the model‑implied distribution:
[ L_\beta(\theta)=\frac{1}{\beta(\beta+1)}\Big[ \sum_i f_\theta(y_i)^{\beta+1} - (\beta+1)\sum_i f_\theta(y_i)^\beta \Big], ]
where (f_\theta) is the conditional density implied by the NN output and β > 0 controls robustness.
-
Alternating optimization – The loss is not jointly convex in network weights and the auxiliary scaling variables that appear after applying the β‑divergence. The authors split the problem into two blocks:
- Weight update (gradient‑based step on the NN parameters).
- Auxiliary variable update (closed‑form solution derived from the β‑divergence).
Repeating these steps yields a monotone decrease in the objective and converges to a stationary point.
-
Robustness analysis – Using classical influence‑function calculus, the paper shows that for a properly chosen β the derivative of the estimator with respect to an infinitesimal contamination is bounded. This translates into a breakdown point of 50 %: the estimator can tolerate up to half the data being arbitrarily corrupted before it breaks down.
-
Implementation details – The authors provide a lightweight PyTorch‑compatible module that wraps any existing regression NN architecture. The only new hyper‑parameter is β (typically chosen in the range 0.1–0.5).
Results & Findings
| Experiment | Baseline (MSE) | Huber loss | rRNet (β=0.3) | Relative RMSE reduction |
|---|---|---|---|---|
| Synthetic 1‑D regression with 30 % outliers | 1.42 | 1.08 | 0.71 | 50 % |
| UCI Boston Housing (10 % label noise) | 3.12 | 2.87 | 2.31 | 26 % |
| Time‑series demand forecasting (real‑world, sensor glitches) | 5.6 % MAPE | 5.1 % | 4.2 % | 25 % |
- Convergence: The alternating scheme reaches a stationary point within 30–50 epochs for typical network sizes, comparable to standard SGD on MSE.
- Influence functions: Empirically measured sensitivities match the theoretical bounded curves, confirming the robustness claim.
- Ablation on β: Smaller β (≈0.1) behaves like MLE (high variance under contamination); larger β (≈0.7) can under‑weight legitimate data, slightly increasing bias. The sweet spot around 0.3–0.5 works well across tasks.
Practical Implications
- Outlier‑prone pipelines – Data‑driven services (e.g., sensor analytics, financial forecasting, A/B test result modeling) can swap their loss function for rRNet and gain automatic protection against corrupted entries without hand‑crafting data‑cleaning rules.
- Minimal code changes – Because rRNet is a drop‑in replacement for the loss term, existing PyTorch/TensorFlow models need only a single import and a β hyper‑parameter.
- Safety‑critical ML – In domains where a single bad observation could trigger catastrophic decisions (autonomous driving perception, medical dosage prediction), the 50 % breakdown guarantee offers a formal safety margin that most current NN training regimes lack.
- Model‑agnostic robustness – The framework works with ReLU, leaky‑ReLU, tanh, or even piecewise‑linear activations, and does not require smooth error density assumptions, making it suitable for modern deep regression architectures (e.g., residual nets, transformer‑based regressors).
Limitations & Future Work
- Local optimality – The convergence proof guarantees reaching a stationary point, not a global optimum; like any non‑convex NN training, the final solution can depend on initialization.
- Choice of β – While the authors provide theoretical guidance based on the assumed error density, selecting β in practice still involves a modest validation sweep.
- Scalability to massive datasets – The alternating scheme introduces an extra per‑batch update step; the overhead is modest for medium‑scale data but could become noticeable for billions of samples.
- Extension to classification – The current theory is limited to regression with continuous outputs; adapting β‑divergence robustness to classification (e.g., softmax outputs) is an open direction.
Overall, rRNet offers a theoretically grounded, easy‑to‑integrate tool for making regression neural networks resilient to noisy, adversarial, or simply messy data—a frequent pain point for developers building real‑world ML systems.
Authors
- Abhik Ghosh
- Suryasis Jana
Paper Information
- arXiv ID: 2602.08933v1
- Categories: stat.ML, cs.LG, cs.NE, stat.ME
- Published: February 9, 2026
- PDF: Download PDF