[Paper] Mean Estimation from Coarse Data: Characterizations and Efficient Algorithms

Published: (February 26, 2026 at 01:47 PM EST)
5 min read
Source: arXiv

Source: arXiv - 2602.23341v1

Overview

This paper tackles a surprisingly common problem: estimating the mean of a multivariate Gaussian when you never see the raw data points, only coarse “buckets” that tell you which region of space each point fell into. Think of sensor readings that are rounded to the nearest grid cell, or financial data that’s reported in ranges rather than exact values. The authors give a complete picture of when the true mean can be uniquely recovered under convex partitions, and they provide polynomial‑time algorithms that achieve optimal statistical accuracy in those cases.

Key Contributions

  • Identifiability Characterization: Precise necessary and sufficient conditions (in terms of the geometry of the convex partition) that guarantee the Gaussian mean is uniquely determined from coarse observations.
  • Efficient Estimation Algorithms: Two polynomial‑time procedures—one based on convex optimization and another on a carefully designed iterative refinement—that achieve the minimax optimal sample complexity under the identified conditions.
  • Hardness Boundary: Formal proof that once the convexity requirement is dropped, the mean‑estimation problem becomes NP‑hard, sharpening earlier results.
  • Robustness Guarantees: Extensions showing the algorithms tolerate a modest amount of adversarial noise and mild misspecification of the partition.
  • Empirical Validation: Synthetic experiments confirming that the proposed methods match theoretical predictions and dramatically outperform naïve baselines (e.g., treating bucket centroids as raw samples).

Methodology

  1. Problem Formalization:

    • True data (x \sim \mathcal{N}(\mu, I_d)).
    • The space (\mathbb{R}^d) is partitioned into convex cells ({C_1,\dots,C_m}).
    • For each draw we only observe the index (i) such that (x \in C_i).
  2. Identifiability Analysis:

    • The authors study the mapping (\mu \mapsto { \Pr[x \in C_i] }_{i=1}^m).
    • Using tools from convex geometry and properties of Gaussian measures, they prove that the Jacobian of this mapping is full rank iff the convex cells satisfy a “strict separation” condition (no cell is a convex combination of others in the sense of Gaussian moments).
  3. Algorithmic Design:

    • Maximum Likelihood via Convex Programming: The log‑likelihood of the coarse data is a concave function of (\mu) under the identified conditions, allowing a straightforward gradient‑based optimizer.
    • Iterative Moment Matching: Starting from an arbitrary guess, the algorithm repeatedly solves a linear program that aligns the predicted bucket probabilities with the empirical frequencies, shrinking the error geometrically.
  4. Statistical Analysis:

    • Leveraging concentration of measure for Gaussians, they bound the deviation between empirical bucket frequencies and their expectations, yielding a sample‑complexity of
      [ O!\bigl(\tfrac{d\log m}{\varepsilon^2}\bigr) ]
      to achieve (|\hat\mu-\mu|_2 \le \varepsilon).
  5. Hardness Proof:

    • By reduction from Exact Set Cover, they construct non‑convex partitions where any estimator would solve an NP‑hard decision problem, establishing the computational barrier.

Results & Findings

SettingSample ComplexityRuntimeAccuracy (‖\hatμ‑μ‖₂)
Convex partitions satisfying identifiability(O\bigl(\frac{d\log m}{\varepsilon^2}\bigr))Polynomial (≈ (O(md^2)) per iteration)(\varepsilon) with high probability
Non‑convex partitions (hard case)NP‑hard (no poly‑time guarantee)
Noisy bucket labels (≤ 5 % adversarial)Same order, with extra (\log(1/\delta)) factorStill polynomialDegradation bounded by noise level

Empirically, the convex‑programming estimator converges in < 30 iterations for dimensions up to 100 and partitions with thousands of cells, matching the theoretical error curves.

Practical Implications

  • Sensor Networks & IoT: Devices that report quantized measurements (e.g., temperature ranges) can now feed their coarse data into a central estimator that recovers the underlying mean temperature with provable guarantees, eliminating the need for costly high‑resolution hardware.
  • Privacy‑Preserving Analytics: Coarse bucketing is a common primitive for differential privacy. This work shows that, under convex bucketing schemes, analysts can still obtain accurate population means without compromising privacy budgets.
  • Financial & Economic Modeling: When macro‑economic indicators are released in bands (e.g., “inflation between 2‑3 %”), the algorithms enable precise inference of underlying trends, aiding policy simulations.
  • Machine‑Learning Pipelines: Pre‑processing steps that discretize continuous features (e.g., histogram binning) can be treated as coarse observations; the proposed methods allow downstream models to correct for the introduced bias rather than discarding the data.
  • Robust Data Integration: In heterogeneous data‑fusion scenarios where some sources provide exact values and others only ranges, the convex‑optimization framework naturally blends both types of information.

Limitations & Future Work

  • Assumption of Known Partition: The algorithms require the exact geometry of the convex cells. In practice, partitions may be only approximately known; extending the theory to learned partitions is an open direction.
  • Identity Covariance: The analysis focuses on isotropic Gaussians. Handling unknown or anisotropic covariance matrices would broaden applicability to real‑world sensor noise models.
  • Scalability to Massive Partitions: While polynomial, the runtime scales linearly with the number of cells (m). For ultra‑fine bucketing (e.g., millions of cells), specialized stochastic or distributed solvers are needed.
  • Beyond Gaussian Distributions: Many practical signals are heavy‑tailed or multimodal. Adapting the identifiability criteria and algorithms to other exponential families is a promising research avenue.

Authors

  • Alkis Kalavasis
  • Anay Mehrotra
  • Manolis Zampetakis
  • Felix Zhou
  • Ziyu Zhu

Paper Information

  • arXiv ID: 2602.23341v1
  • Categories: cs.LG, cs.DS, math.ST, stat.ML
  • Published: February 26, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] Model Agreement via Anchoring

Numerous lines of aim to control model disagreement -- the extent to which two machine learning models disagree in their predictions. We adopt a simple and stan...

[Paper] A Dataset is Worth 1 MB

A dataset server must often distribute the same large payload to many clients, incurring massive communication costs. Since clients frequently operate on divers...