[Paper] Sample Complexity Bounds for Robust Mean Estimation with Mean-Shift Contamination

Published: (February 25, 2026 at 12:21 PM EST)
5 min read
Source: arXiv

Source: arXiv - 2602.22130v1

Overview

This paper tackles a fundamental problem for data‑driven systems: estimating the average (mean) of a distribution when a small fraction of the collected data has been deliberately corrupted. Unlike the classic Huber contamination model, the mean‑shift model lets an adversary replace clean samples with points drawn from the same underlying distribution but shifted by an arbitrary vector. The authors close a long‑standing open question by characterizing exactly how many samples are needed to recover the true mean for a broad class of distributions, using tools from Fourier analysis.

Key Contributions

  • General Sample‑Complexity Bounds: Tight upper and lower bounds on the number of samples required for robust mean estimation under mean‑shift contamination for any distribution whose characteristic function satisfies mild spectral conditions.
  • Fourier‑Witness Technique: A novel “Fourier witness” construct that captures how contamination manifests in the frequency domain, enabling both algorithm design and lower‑bound proofs.
  • Algorithmic Solution: A computationally efficient estimator that achieves the optimal sample complexity and works for multivariate distributions.
  • Bridging Theory Gaps: Extends prior results that were limited to Gaussian and Laplace families, showing that consistent estimation is possible far beyond those special cases.

Methodology

  1. Model Formalization:

    • Clean data: i.i.d. draws from a base distribution (P) with unknown mean (\mu).
    • Contamination: An adversary may replace a constant fraction (\epsilon) of the samples with draws from (P) shifted by arbitrary vectors (different for each corrupted point).
  2. Spectral Condition on the Base Distribution:

    • The characteristic function (\phi_P(t)=\mathbb{E}[e^{i t^\top X}]) must not vanish too quickly; formally, (|\phi_P(t)|) stays bounded away from zero on a ball of radius proportional to (1/\sqrt{\epsilon}). This condition holds for most “well‑behaved’’ distributions (e.g., sub‑Gaussian, bounded‑support, many heavy‑tailed families).
  3. Fourier Witness Construction:

    • By taking the empirical characteristic function of the observed data, the algorithm isolates a frequency (t^*) where the contamination’s effect is maximally detectable.
    • The “witness” is essentially the phase shift induced by the corrupted points at (t^*), which can be estimated reliably from a modest number of samples.
  4. Estimator Design:

    • Use the identified witness to solve a linear system that recovers the unknown shift vectors’ aggregate effect, then subtract this bias from the empirical mean.
    • The procedure runs in polynomial time (dominated by FFT‑style operations on the empirical characteristic function).
  5. Lower‑Bound Argument:

    • Construct a family of hard instances where any estimator must distinguish between two means that differ by a small amount, yet the contaminated samples can mask this difference.
    • By analyzing the information loss in the Fourier domain, the authors prove that any algorithm needs at least the same order of samples as their upper bound.

Results & Findings

QuantityUpper Bound (Algorithm)Lower Bound (Information‑theoretic)
Sample complexity to achieve error (\delta)(O!\left(\frac{d}{\epsilon^2}\log\frac{1}{\delta}\right)) (up to polylog factors)(\Omega!\left(\frac{d}{\epsilon^2}\log\frac{1}{\delta}\right))
Runtime(\tilde{O}(nd + d^2)) (near‑linear in data)
  • Interpretation: For a constant contamination level (\epsilon), the number of required samples scales linearly with the dimension (d) and inversely with (\epsilon^2), matching the intuition from classical robust statistics but now valid for a far richer class of distributions.
  • Robustness: The estimator remains consistent (error → 0 as (n) → ∞) even when the adversary chooses the worst possible shift vectors, a property that fails under Huber’s model for many distributions.

Practical Implications

  • Data Cleaning Pipelines: Engineers can embed the Fourier‑witness estimator as a drop‑in replacement for naïve averaging when ingesting sensor streams, logs, or user‑generated data that may contain systematic drifts or malicious injections.
  • Federated Learning & Edge Analytics: In distributed settings where a few devices may be compromised (e.g., sending shifted gradients), the algorithm provides a lightweight, provably optimal way to aggregate updates without heavy cryptographic overhead.
  • Anomaly Detection: The frequency‑domain witness can double as a diagnostic signal: unusually large witness magnitudes flag batches with suspicious contamination, enabling automated alerts.
  • Scalable Implementation: Because the core operations are FFT‑like, the method can be parallelized on GPUs or distributed clusters, making it suitable for large‑scale telemetry or log‑analysis pipelines.

Limitations & Future Work

  • Spectral Condition Requirement: Guarantees hinge on the characteristic function not vanishing too quickly. While this holds for many common distributions, exotic heavy‑tailed or multimodal data may violate the condition, requiring alternative analysis.
  • Constant‑Factor Gaps: The theoretical bounds hide polylogarithmic factors; tightening these constants could improve sample efficiency in practice.
  • Extension to Other Robust Tasks: The Fourier‑witness concept may be adaptable to robust covariance estimation, clustering, or regression under mean‑shift contamination—an avenue the authors suggest but leave open.
  • Adaptive Contamination Levels: The current analysis assumes a known contamination fraction (\epsilon). Developing estimators that adapt to unknown or varying (\epsilon) would broaden applicability.

Bottom line: By marrying Fourier analysis with robust statistics, this work delivers a practically implementable, theoretically optimal solution for mean estimation when data can be arbitrarily shifted—a scenario increasingly common in modern, adversarial‑prone data pipelines.

Authors

  • Ilias Diakonikolas
  • Giannis Iakovidis
  • Daniel M. Kane
  • Sihan Liu

Paper Information

  • arXiv ID: 2602.22130v1
  • Categories: cs.LG, cs.DS
  • Published: February 25, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] Model Agreement via Anchoring

Numerous lines of aim to control model disagreement -- the extent to which two machine learning models disagree in their predictions. We adopt a simple and stan...

[Paper] A Dataset is Worth 1 MB

A dataset server must often distribute the same large payload to many clients, incurring massive communication costs. Since clients frequently operate on divers...