[Paper] Statistical Query Lower Bounds for Smoothed Agnostic Learning

Published: (February 24, 2026 at 01:46 PM EST)
5 min read
Source: arXiv

Source: arXiv - 2602.21191v1

Overview

This paper investigates smoothed agnostic learning, a learning setting where the data distribution is slightly “blurred” by Gaussian noise before the learner tries to fit a model. Focusing on the classic problem of learning half‑spaces (linear classifiers) under sub‑Gaussian inputs, the authors prove that the best known algorithm—based on (L_1) polynomial regression—is essentially optimal: any algorithm that relies only on statistical queries must incur a runtime that grows exponentially in (1/\sigma^2) (the amount of smoothing) and (\log(1/\varepsilon)) (the target excess error).

Key Contributions

  • SQ Lower Bound for Smoothed Agnostic Learning – Shows that any statistical‑query algorithm needs at least (d^{\Omega(1/\sigma^{2} + \log(1/\varepsilon))}) time/queries to learn half‑spaces in the smoothed model.
  • Near‑Matching Upper Bound – Confirms that the existing (L_1)-polynomial‑regression algorithm with complexity (d^{\tilde O(1/\sigma^{2})\log(1/\varepsilon)}) is close to optimal.
  • Moment‑Matching Hard Distribution Construction – Uses linear‑programming duality to build a distribution that fools low‑degree polynomial approximators, establishing the lower bound.
  • Bridges Approximation Theory and Learning Theory – Demonstrates that the difficulty of the learning problem is tied to the degree needed to approximate the smoothed target function with a low‑degree polynomial.

Methodology

  1. Statistical Query (SQ) Model – The authors restrict the learner to access the data only through expectations of bounded functions (statistical queries). This model captures many practical algorithms (e.g., gradient‑based methods, moment estimators) and is a standard way to prove lower bounds.

  2. Hard Distribution via Moment Matching

    • Formulate a linear program whose primal variables describe a candidate data distribution and whose constraints enforce that low‑degree moments match those of a “nice” reference distribution (Gaussian).
    • The dual of this program yields a low‑degree polynomial that would separate the two distributions if it existed.
  3. Approximation‑Degree Argument

    • Prove that any polynomial that approximates the smoothed half‑space indicator must have degree at least (\Omega(1/\sigma^{2} + \log(1/\varepsilon))).
    • Because SQ algorithms can be simulated by low‑degree polynomial estimators, this degree lower bound translates directly into an SQ query‑complexity lower bound.
  4. Reduction to Known Upper Bound

    • Compare the derived lower bound with the runtime of the best known (L_1)-polynomial‑regression algorithm, showing they match up to polylogarithmic factors.

Results & Findings

ParameterUpper bound (known algorithm)Lower bound (this work)
Runtime / query complexity(d^{\tilde O(1/\sigma^{2})\log(1/\varepsilon)})(d^{\Omega(1/\sigma^{2} + \log(1/\varepsilon))})
SettingGaussian or any sub‑Gaussian input, smoothed by Gaussian noise of variance (\sigma^{2})Same setting (even for pure Gaussian marginals)
Core insight(L_1)-polynomial regression essentially captures the best possible low‑degree approximation of the smoothed targetAny SQ algorithm must implicitly construct such an approximation, and the required degree cannot be lower

In plain terms, the paper proves that you cannot beat the existing algorithm by more than a polylogarithmic factor if you stay within the SQ framework. The difficulty stems from the need to approximate a “blurred” step function (the half‑space) with a low‑degree polynomial, which becomes harder as the smoothing gets smaller ((\sigma \to 0)) or as you demand higher accuracy ((\varepsilon \to 0)).

Practical Implications

  • Algorithm Design Guidance – For practitioners building robust linear classifiers under slight input noise (e.g., sensor jitter, privacy‑preserving perturbations), the paper suggests that investing in more sophisticated SQ‑style tricks will not dramatically improve over polynomial‑regression‑based methods.
  • Resource Planning – The exponential dependence on (1/\sigma^{2}) warns that very low‑noise regimes are computationally expensive. If your application can tolerate a modest amount of smoothing (e.g., (\sigma = 0.1)), the required runtime stays polynomial; otherwise, expect a steep cost.
  • Benchmarking – The lower bound provides a theoretical baseline against which new algorithms (perhaps using non‑SQ techniques like direct access to raw samples or quantum queries) can be measured.
  • Feature Engineering – Since the hardness originates from the need to approximate a smoothed step, feature transformations that make the decision boundary smoother (e.g., adding calibrated noise, using soft margins) could reduce the effective degree needed, leading to faster learning.

Limitations & Future Work

  • SQ‑Only Scope – The lower bound applies to algorithms that can be expressed as statistical queries. It does not rule out non‑SQ methods (e.g., algorithms that exploit higher‑order moments directly or use interactive sampling) from achieving better runtimes.
  • Specific to Half‑Spaces – While half‑spaces are a cornerstone model, extending the analysis to more complex hypothesis classes (e.g., neural nets, decision trees) remains open.
  • Gaussian Smoothing Assumption – The proof hinges on Gaussian perturbations; other smoothing kernels (Laplace, uniform) might exhibit different complexities.
  • Tightness of Polylog Factors – The upper and lower bounds match up to polylogarithmic terms. Closing this gap (removing the tilde notation) could sharpen our understanding of the exact trade‑off between (\sigma) and (\varepsilon).

Authors

  • Ilias Diakonikolas
  • Daniel M. Kane

Paper Information

  • arXiv ID: 2602.21191v1
  • Categories: cs.LG, cs.DS, stat.ML
  • Published: February 24, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] Model Agreement via Anchoring

Numerous lines of aim to control model disagreement -- the extent to which two machine learning models disagree in their predictions. We adopt a simple and stan...

[Paper] A Dataset is Worth 1 MB

A dataset server must often distribute the same large payload to many clients, incurring massive communication costs. Since clients frequently operate on divers...