[Paper] High-Dimensional Gaussian Mean Estimation under Realizable Contamination

Published: (March 17, 2026 at 01:04 PM EDT)
5 min read
Source: arXiv

Source: arXiv - 2603.16798v1

Overview

This paper tackles a subtle but common problem: estimating the mean of a high‑dimensional Gaussian when some data points are missing in a controlled way. The authors study the realizable ε‑contamination model, where an adversary can hide each sample with a probability up to ε that may depend on the sample itself—an intermediate setting between the “completely at random” and the fully “not at random” missing‑data regimes. Their main finding is that, unlike the information‑theoretic optimum, any algorithm that runs in polynomial time must either collect many more samples or accept exponential runtime, establishing a concrete information‑computation gap.

Key Contributions

  • Formalized the computational hardness of Gaussian mean estimation under realizable contamination using the Statistical Query (SQ) framework.
  • Proved an SQ lower bound that forces a trade‑off: either sample complexity grows by a factor of ~(1/ε^2) or runtime becomes exponential in the dimension (d).
  • Showed that the same lower bound extends to low‑degree polynomial and polynomial‑threshold‑function (PTF) tests, covering a broad class of algorithmic techniques.
  • Designed a near‑optimal algorithm that matches the lower bound up to polylogarithmic factors, offering a concrete sample‑time trade‑off curve.
  • Provided the first qualitative characterization of the computational landscape for this intermediate missing‑data model, complementing earlier information‑theoretic results that required exponential time.

Methodology

  1. Statistical Query Reduction – The authors recast the estimation problem as a set of statistical queries (expectations of bounded functions) and then apply known SQ lower‑bound techniques (e.g., constructing hard distributions that are indistinguishable under a limited number of queries).
  2. Hard Instance Construction – They build a family of Gaussian mixtures where the contaminating mechanism subtly skews the observed data, making it impossible for any low‑degree polynomial or SQ algorithm to separate the true mean without many queries.
  3. Algorithmic Upper Bound – Using a filter‑based approach, the algorithm iteratively discards samples that appear inconsistent with the current mean estimate, while carefully controlling the probability of discarding good points. The filter is implemented via a series of statistical queries, yielding a runtime that scales polynomially in the number of queries and the dimension.
  4. Trade‑off Analysis – By varying the filter’s aggressiveness, the authors derive a continuum of algorithms that interpolate between low sample complexity (but high runtime) and high sample complexity (but polynomial runtime), matching the SQ lower bound up to logarithmic factors.

Results & Findings

MetricInformation‑theoretic optimumSQ‑lower‑bound (hardness)Achievable algorithm
Sample complexity(O!\left(\frac{d}{\epsilon^2}\right))(\Omega!\left(\frac{d}{\epsilon^2}\right)) or exponential time(\tilde{O}!\left(\frac{d}{\epsilon^2} \cdot \text{poly}(1/\epsilon)\right)) with poly‑time
RuntimeExponential in (d) (previous work)Must be exponential if sample size stays optimalNear‑optimal trade‑off: ( \text{time} = \tilde{O}!\left( \exp!\big( \Theta(\epsilon^2 n / d) \big) \right) ) for sample budget (n)

In plain terms, if you want to achieve the statistically optimal number of samples, you cannot do it in polynomial time; you must either accept a polynomial‑time algorithm that uses significantly more samples (by a factor roughly (1/\epsilon^2)) or resort to exponential‑time methods. The proposed algorithm lets practitioners pick the point on this curve that best fits their constraints.

Practical Implications

  • Data‑pipeline design – When building systems that ingest high‑dimensional sensor or telemetry data with occasional drop‑outs, engineers now have a clear guideline: if missingness can depend on the data (e.g., packet loss correlated with signal strength), achieving statistically optimal estimates will be computationally expensive.
  • Robust analytics libraries – The filter‑based algorithm can be incorporated into ML preprocessing tools (e.g., scikit‑learn, TensorFlow Data Validation) as a “robust mean estimator” that automatically balances sample size vs. runtime based on a user‑specified ε.
  • Resource budgeting – Cloud services that charge per compute cycle can use the derived trade‑off to decide whether to allocate extra compute (to keep sample size low) or to collect more data points (to stay within budget).
  • Security & adversarial settings – The model captures an adversary that selectively hides data. Understanding the computational limits helps security engineers assess how much extra data they need to counteract potential data‑poisoning attacks that exploit missingness.

Overall, the work warns developers that “nice‑looking” statistical guarantees may hide hidden computational costs when data loss is not completely random.

Limitations & Future Work

  • The lower bounds are proved in the Statistical Query model; while this covers many practical algorithms, it does not rule out exotic non‑SQ methods that might circumvent the gap.
  • The analysis assumes identity covariance; extending the results to unknown or anisotropic covariances remains open.
  • The proposed algorithm’s constants hidden in the (\tilde{O}) notation can be large for very small ε, so practical tuning may be required.
  • Future research could explore adaptive sampling strategies, tighter lower bounds for specific algorithm families, or empirical evaluations on real‑world missing‑data workloads.

Authors

  • Ilias Diakonikolas
  • Daniel M. Kane
  • Thanasis Pittas

Paper Information

  • arXiv ID: 2603.16798v1
  • Categories: cs.LG, cs.DS, math.ST, stat.ML
  • Published: March 17, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »