[Paper] Sample Complexity Bounds for Robust Mean Estimation with Mean-Shift Contamination

Published: 3 days ago (February 25, 2026 at 12:21 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.22130v1

Overview

This paper tackles a fundamental problem for data‑driven systems: estimating the average (mean) of a distribution when a small fraction of the collected data has been deliberately corrupted. Unlike the classic Huber contamination model, the mean‑shift model lets an adversary replace clean samples with points drawn from the same underlying distribution but shifted by an arbitrary vector. The authors close a long‑standing open question by characterizing exactly how many samples are needed to recover the true mean for a broad class of distributions, using tools from Fourier analysis.

Key Contributions

General Sample‑Complexity Bounds: Tight upper and lower bounds on the number of samples required for robust mean estimation under mean‑shift contamination for any distribution whose characteristic function satisfies mild spectral conditions.
Fourier‑Witness Technique: A novel “Fourier witness” construct that captures how contamination manifests in the frequency domain, enabling both algorithm design and lower‑bound proofs.
Algorithmic Solution: A computationally efficient estimator that achieves the optimal sample complexity and works for multivariate distributions.
Bridging Theory Gaps: Extends prior results that were limited to Gaussian and Laplace families, showing that consistent estimation is possible far beyond those special cases.

Methodology

Model Formalization:
- Clean data: i.i.d. draws from a base distribution (P) with unknown mean (\mu).
- Contamination: An adversary may replace a constant fraction (\epsilon) of the samples with draws from (P) shifted by arbitrary vectors (different for each corrupted point).
Spectral Condition on the Base Distribution:
- The characteristic function (\phi_P(t)=\mathbb{E}[e^{i t^\top X}]) must not vanish too quickly; formally, (|\phi_P(t)|) stays bounded away from zero on a ball of radius proportional to (1/\sqrt{\epsilon}). This condition holds for most “well‑behaved’’ distributions (e.g., sub‑Gaussian, bounded‑support, many heavy‑tailed families).
Fourier Witness Construction:
- By taking the empirical characteristic function of the observed data, the algorithm isolates a frequency (t^*) where the contamination’s effect is maximally detectable.
- The “witness” is essentially the phase shift induced by the corrupted points at (t^*), which can be estimated reliably from a modest number of samples.
Estimator Design:
- Use the identified witness to solve a linear system that recovers the unknown shift vectors’ aggregate effect, then subtract this bias from the empirical mean.
- The procedure runs in polynomial time (dominated by FFT‑style operations on the empirical characteristic function).
Lower‑Bound Argument:
- Construct a family of hard instances where any estimator must distinguish between two means that differ by a small amount, yet the contaminated samples can mask this difference.
- By analyzing the information loss in the Fourier domain, the authors prove that any algorithm needs at least the same order of samples as their upper bound.

Results & Findings

Quantity	Upper Bound (Algorithm)	Lower Bound (Information‑theoretic)
Sample complexity to achieve error (\delta)	(O!\left(\frac{d}{\epsilon^2}\log\frac{1}{\delta}\right)) (up to polylog factors)	(\Omega!\left(\frac{d}{\epsilon^2}\log\frac{1}{\delta}\right))
Runtime	(\tilde{O}(nd + d^2)) (near‑linear in data)	—

Interpretation: For a constant contamination level (\epsilon), the number of required samples scales linearly with the dimension (d) and inversely with (\epsilon^2), matching the intuition from classical robust statistics but now valid for a far richer class of distributions.
Robustness: The estimator remains consistent (error → 0 as (n) → ∞) even when the adversary chooses the worst possible shift vectors, a property that fails under Huber’s model for many distributions.

Practical Implications

Data Cleaning Pipelines: Engineers can embed the Fourier‑witness estimator as a drop‑in replacement for naïve averaging when ingesting sensor streams, logs, or user‑generated data that may contain systematic drifts or malicious injections.
Federated Learning & Edge Analytics: In distributed settings where a few devices may be compromised (e.g., sending shifted gradients), the algorithm provides a lightweight, provably optimal way to aggregate updates without heavy cryptographic overhead.
Anomaly Detection: The frequency‑domain witness can double as a diagnostic signal: unusually large witness magnitudes flag batches with suspicious contamination, enabling automated alerts.
Scalable Implementation: Because the core operations are FFT‑like, the method can be parallelized on GPUs or distributed clusters, making it suitable for large‑scale telemetry or log‑analysis pipelines.

Limitations & Future Work

Spectral Condition Requirement: Guarantees hinge on the characteristic function not vanishing too quickly. While this holds for many common distributions, exotic heavy‑tailed or multimodal data may violate the condition, requiring alternative analysis.
Constant‑Factor Gaps: The theoretical bounds hide polylogarithmic factors; tightening these constants could improve sample efficiency in practice.
Extension to Other Robust Tasks: The Fourier‑witness concept may be adaptable to robust covariance estimation, clustering, or regression under mean‑shift contamination—an avenue the authors suggest but leave open.
Adaptive Contamination Levels: The current analysis assumes a known contamination fraction (\epsilon). Developing estimators that adapt to unknown or varying (\epsilon) would broaden applicability.

Bottom line: By marrying Fourier analysis with robust statistics, this work delivers a practically implementable, theoretically optimal solution for mean estimation when data can be arbitrarily shifted—a scenario increasingly common in modern, adversarial‑prone data pipelines.

Authors

Ilias Diakonikolas
Giannis Iakovidis
Daniel M. Kane
Sihan Liu

Paper Information

arXiv ID: 2602.22130v1
Categories: cs.LG, cs.DS
Published: February 25, 2026
PDF: Download PDF

[Paper] Sample Complexity Bounds for Robust Mean Estimation with Mean-Shift Contamination

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Model Agreement via Anchoring

[Paper] SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation

[Paper] A Dataset is Worth 1 MB

[Paper] SOTAlign: Semi-Supervised Alignment of Unimodal Vision and Language Models via Optimal Transport