[Paper] Provably robust learning of regression neural networks using $β$-divergences

Published: 3 days ago (February 9, 2026 at 12:32 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.08933v1

Overview

The paper introduces rRNet, a new training framework for regression‑type neural networks that is provably robust to outliers and contaminated data. By leveraging the β‑divergence (also known as density‑power divergence), the authors replace the usual mean‑squared‑error loss with a family of loss functions that can automatically down‑weight suspicious samples while still retaining the familiar maximum‑likelihood case as a special setting.

Key Contributions

β‑divergence loss for regression NNs – a unified objective that works with smooth or non‑smooth activations and with a wide range of error distributions.
Alternating optimization algorithm with provable convergence to stationary points under mild, checkable conditions.
Theoretical robustness guarantees: bounded influence functions for both parameters and predictions, and an asymptotic breakdown point of 50 % for any β ∈ (0, 1].
Recovery of classic MLE when β → 0, so existing pipelines can be switched to rRNet with a single hyper‑parameter tweak.
Extensive empirical validation on synthetic benchmarks and real‑world regression tasks, showing superior performance over standard MSE training and several ad‑hoc robust tricks (e.g., Huber loss, data clipping).

Methodology

Loss formulation – Instead of minimizing
[ \frac{1}{n}\sum_{i}(y_i-\hat y_i)^2, ]
rRNet minimizes the β‑divergence between the empirical data distribution and the model‑implied distribution:
[ L_\beta(\theta)=\frac{1}{\beta(\beta+1)}\Big[ \sum_i f_\theta(y_i)^{\beta+1} - (\beta+1)\sum_i f_\theta(y_i)^\beta \Big], ]
where (f_\theta) is the conditional density implied by the NN output and β > 0 controls robustness.
Alternating optimization – The loss is not jointly convex in network weights and the auxiliary scaling variables that appear after applying the β‑divergence. The authors split the problem into two blocks:
- Weight update (gradient‑based step on the NN parameters).
- Auxiliary variable update (closed‑form solution derived from the β‑divergence).
Repeating these steps yields a monotone decrease in the objective and converges to a stationary point.
Robustness analysis – Using classical influence‑function calculus, the paper shows that for a properly chosen β the derivative of the estimator with respect to an infinitesimal contamination is bounded. This translates into a breakdown point of 50 %: the estimator can tolerate up to half the data being arbitrarily corrupted before it breaks down.
Implementation details – The authors provide a lightweight PyTorch‑compatible module that wraps any existing regression NN architecture. The only new hyper‑parameter is β (typically chosen in the range 0.1–0.5).

Results & Findings

Experiment	Baseline (MSE)	Huber loss	rRNet (β=0.3)	Relative RMSE reduction
Synthetic 1‑D regression with 30 % outliers	1.42	1.08	0.71	50 %
UCI Boston Housing (10 % label noise)	3.12	2.87	2.31	26 %
Time‑series demand forecasting (real‑world, sensor glitches)	5.6 % MAPE	5.1 %	4.2 %	25 %

Convergence: The alternating scheme reaches a stationary point within 30–50 epochs for typical network sizes, comparable to standard SGD on MSE.
Influence functions: Empirically measured sensitivities match the theoretical bounded curves, confirming the robustness claim.
Ablation on β: Smaller β (≈0.1) behaves like MLE (high variance under contamination); larger β (≈0.7) can under‑weight legitimate data, slightly increasing bias. The sweet spot around 0.3–0.5 works well across tasks.

Practical Implications

Outlier‑prone pipelines – Data‑driven services (e.g., sensor analytics, financial forecasting, A/B test result modeling) can swap their loss function for rRNet and gain automatic protection against corrupted entries without hand‑crafting data‑cleaning rules.
Minimal code changes – Because rRNet is a drop‑in replacement for the loss term, existing PyTorch/TensorFlow models need only a single import and a β hyper‑parameter.
Safety‑critical ML – In domains where a single bad observation could trigger catastrophic decisions (autonomous driving perception, medical dosage prediction), the 50 % breakdown guarantee offers a formal safety margin that most current NN training regimes lack.
Model‑agnostic robustness – The framework works with ReLU, leaky‑ReLU, tanh, or even piecewise‑linear activations, and does not require smooth error density assumptions, making it suitable for modern deep regression architectures (e.g., residual nets, transformer‑based regressors).

Limitations & Future Work

Local optimality – The convergence proof guarantees reaching a stationary point, not a global optimum; like any non‑convex NN training, the final solution can depend on initialization.
Choice of β – While the authors provide theoretical guidance based on the assumed error density, selecting β in practice still involves a modest validation sweep.
Scalability to massive datasets – The alternating scheme introduces an extra per‑batch update step; the overhead is modest for medium‑scale data but could become noticeable for billions of samples.
Extension to classification – The current theory is limited to regression with continuous outputs; adapting β‑divergence robustness to classification (e.g., softmax outputs) is an open direction.

Overall, rRNet offers a theoretically grounded, easy‑to‑integrate tool for making regression neural networks resilient to noisy, adversarial, or simply messy data—a frequent pain point for developers building real‑world ML systems.

Authors

Abhik Ghosh
Suryasis Jana

Paper Information

arXiv ID: 2602.08933v1
Categories: stat.ML, cs.LG, cs.NE, stat.ME
Published: February 9, 2026
PDF: Download PDF

[Paper] Provably robust learning of regression neural networks using $β$-divergences

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Diffusion-Pretrained Dense and Contextual Embeddings

[Paper] YOR: Your Own Mobile Manipulator for Generalizable Robotics

[Paper] Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling

[Paper] SCRAPL: Scattering Transform with Random Paths for Machine Learning