[Paper] High-Dimensional Gaussian Mean Estimation under Realizable Contamination

Published: 3 days ago (March 17, 2026 at 01:04 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2603.16798v1

Overview

This paper tackles a subtle but common problem: estimating the mean of a high‑dimensional Gaussian when some data points are missing in a controlled way. The authors study the realizable ε‑contamination model, where an adversary can hide each sample with a probability up to ε that may depend on the sample itself—an intermediate setting between the “completely at random” and the fully “not at random” missing‑data regimes. Their main finding is that, unlike the information‑theoretic optimum, any algorithm that runs in polynomial time must either collect many more samples or accept exponential runtime, establishing a concrete information‑computation gap.

Key Contributions

Formalized the computational hardness of Gaussian mean estimation under realizable contamination using the Statistical Query (SQ) framework.
Proved an SQ lower bound that forces a trade‑off: either sample complexity grows by a factor of ~(1/ε^2) or runtime becomes exponential in the dimension (d).
Showed that the same lower bound extends to low‑degree polynomial and polynomial‑threshold‑function (PTF) tests, covering a broad class of algorithmic techniques.
Designed a near‑optimal algorithm that matches the lower bound up to polylogarithmic factors, offering a concrete sample‑time trade‑off curve.
Provided the first qualitative characterization of the computational landscape for this intermediate missing‑data model, complementing earlier information‑theoretic results that required exponential time.

Methodology

Statistical Query Reduction – The authors recast the estimation problem as a set of statistical queries (expectations of bounded functions) and then apply known SQ lower‑bound techniques (e.g., constructing hard distributions that are indistinguishable under a limited number of queries).
Hard Instance Construction – They build a family of Gaussian mixtures where the contaminating mechanism subtly skews the observed data, making it impossible for any low‑degree polynomial or SQ algorithm to separate the true mean without many queries.
Algorithmic Upper Bound – Using a filter‑based approach, the algorithm iteratively discards samples that appear inconsistent with the current mean estimate, while carefully controlling the probability of discarding good points. The filter is implemented via a series of statistical queries, yielding a runtime that scales polynomially in the number of queries and the dimension.
Trade‑off Analysis – By varying the filter’s aggressiveness, the authors derive a continuum of algorithms that interpolate between low sample complexity (but high runtime) and high sample complexity (but polynomial runtime), matching the SQ lower bound up to logarithmic factors.

Results & Findings

Metric	Information‑theoretic optimum	SQ‑lower‑bound (hardness)	Achievable algorithm
Sample complexity	(O!\left(\frac{d}{\epsilon^2}\right))	(\Omega!\left(\frac{d}{\epsilon^2}\right)) or exponential time	(\tilde{O}!\left(\frac{d}{\epsilon^2} \cdot \text{poly}(1/\epsilon)\right)) with poly‑time
Runtime	Exponential in (d) (previous work)	Must be exponential if sample size stays optimal	Near‑optimal trade‑off: ( \text{time} = \tilde{O}!\left( \exp!\big( \Theta(\epsilon^2 n / d) \big) \right) ) for sample budget (n)

In plain terms, if you want to achieve the statistically optimal number of samples, you cannot do it in polynomial time; you must either accept a polynomial‑time algorithm that uses significantly more samples (by a factor roughly (1/\epsilon^2)) or resort to exponential‑time methods. The proposed algorithm lets practitioners pick the point on this curve that best fits their constraints.

Practical Implications

Data‑pipeline design – When building systems that ingest high‑dimensional sensor or telemetry data with occasional drop‑outs, engineers now have a clear guideline: if missingness can depend on the data (e.g., packet loss correlated with signal strength), achieving statistically optimal estimates will be computationally expensive.
Robust analytics libraries – The filter‑based algorithm can be incorporated into ML preprocessing tools (e.g., scikit‑learn, TensorFlow Data Validation) as a “robust mean estimator” that automatically balances sample size vs. runtime based on a user‑specified ε.
Resource budgeting – Cloud services that charge per compute cycle can use the derived trade‑off to decide whether to allocate extra compute (to keep sample size low) or to collect more data points (to stay within budget).
Security & adversarial settings – The model captures an adversary that selectively hides data. Understanding the computational limits helps security engineers assess how much extra data they need to counteract potential data‑poisoning attacks that exploit missingness.

Overall, the work warns developers that “nice‑looking” statistical guarantees may hide hidden computational costs when data loss is not completely random.

Limitations & Future Work

The lower bounds are proved in the Statistical Query model; while this covers many practical algorithms, it does not rule out exotic non‑SQ methods that might circumvent the gap.
The analysis assumes identity covariance; extending the results to unknown or anisotropic covariances remains open.
The proposed algorithm’s constants hidden in the (\tilde{O}) notation can be large for very small ε, so practical tuning may be required.
Future research could explore adaptive sampling strategies, tighter lower bounds for specific algorithm families, or empirical evaluations on real‑world missing‑data workloads.

Authors

Ilias Diakonikolas
Daniel M. Kane
Thanasis Pittas

Paper Information

arXiv ID: 2603.16798v1
Categories: cs.LG, cs.DS, math.ST, stat.ML
Published: March 17, 2026
PDF: Download PDF

[Paper] High-Dimensional Gaussian Mean Estimation under Realizable Contamination

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] NavTrust: Benchmarking Trustworthiness for Embodied Navigation

[Paper] FinTradeBench: A Financial Reasoning Benchmark for LLMs

[Paper] F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World

[Paper] Spectrally-Guided Diffusion Noise Schedules