[Paper] Claim against Measurement: Statistical Artefacts in Quantum Error Mitigation Benchmarks

Published: (May 28, 2026 at 08:56 AM EDT)
5 min read
Source: arXiv

Source: arXiv - 2605.29872v1

Overview

The paper Claim against Measurement: Statistical Artefacts in Quantum Error Mitigation Benchmarks examines how the community evaluates quantum error mitigation (QEM) techniques—especially zero‑noise extrapolation (ZNE)—and uncovers systematic statistical shortcomings that can inflate reported performance gains. By reviewing 81 recent QEM studies and conducting extensive experiments, the authors show that many benchmark results are fragile, often depending on hidden parameter choices or temporal drift of the hardware.

Key Contributions

  • Systematic literature audit – Applied an eight‑criterion framework (statistical rigor, reproducibility, reporting quality, etc.) to 81 QEM papers; only 25 % used proper inferential statistics.
  • Parameter‑sensitivity case study – Ran a 132‑configuration sweep of ZNE on a superconducting device, demonstrating that choices such as scale factors, extrapolation method, and calibration dramatically swing outcomes from “significant improvement” to “significant degradation.”
  • Drift‑induced illusion experiment – Conducted a 72‑hour longitudinal run on real hardware, revealing that temporal drift can inflate the apparent effect size of the same ZNE configuration by > 3× and reduces the effective number of independent samples.
  • Practical reporting checklist – Proposed a minimum‑standards guideline for QEM benchmarking (parameter documentation, robustness checks, drift assessment, effect‑size reporting, etc.).
  • Open‑source tooling – Released scripts and data used for the sweep and drift study, enabling other groups to reproduce and extend the analysis.

Methodology

  1. Literature review – The authors built an eight‑criterion rubric covering: (a) use of inferential statistics, (b) uncertainty quantification, (c) reproducibility of code/data, (d) description of hardware calibration, (e) parameter disclosure, (f) robustness analysis, (g) longitudinal stability checks, and (h) effect‑size reporting. Each paper was scored against the rubric.
  2. Experimental benchmark – They selected ZNE as a representative QEM technique because it is widely adopted and relatively simple to implement. Using a 5‑qubit superconducting device, they varied three key knobs:
    • Scale factors (1.0, 1.5, 2.0, 3.0)
    • Extrapolation method (linear, quadratic, Richardson)
    • Calibration strategy (static vs. refreshed before each run)
      This produced 4 × 3 × 11 = 132 distinct configurations. For each configuration they ran a standard circuit (e.g., a depth‑10 variational ansatz) 30 times, recording raw outcomes and post‑processed mitigated results.
  3. Longitudinal drift study – Over 72 hours, the same ZNE configuration was executed repeatedly while the underlying hardware parameters (gate error rates, T1/T2 times) were logged. The authors applied statistical time‑series analysis to quantify how drift altered the measured error‑mitigation gain.
  4. Statistical analysis – They used both descriptive statistics (means, standard deviations) and inferential tests (paired t‑tests, bootstrap confidence intervals) to assess whether observed improvements were statistically significant, and they reported Cohen’s d as an effect‑size metric.

Results & Findings

  • Literature audit: Only 15 of the 81 papers (≈ 25 %) performed any inferential testing; 25 papers (≈ 42 %) reported uncertainties descriptively (e.g., “error bars”) but never tested significance. The remaining papers omitted uncertainty altogether.
  • Parameter sensitivity: The 132‑configuration sweep showed that the same mitigation algorithm could be judged either “effective” (p < 0.05, d ≈ 0.8) or “harmful” (p < 0.05, d ≈ ‑0.6) solely by changing the scale factor or extrapolation method. In 38 % of configurations the statistical conclusion flipped when the calibration schedule was altered.
  • Drift illusion: During the 72‑hour run, the measured mitigation gain varied from a modest 5 % improvement to a spurious 15 % improvement, solely due to hardware drift. The effective sample size dropped by ~ 30 % because successive measurements were correlated, violating the independence assumption of many statistical tests.
  • Effect‑size compression: When proper effect‑size reporting was applied, many “significant” improvements shrank to small or negligible practical impact (Cohen’s d < 0.2).

Overall, the study demonstrates that current QEM benchmark practices can overstate the robustness and utility of mitigation techniques.

Practical Implications

  • For developers: When integrating QEM into a quantum software stack (e.g., Qiskit, Cirq, Braket), you should expose all mitigation hyper‑parameters (scale factors, extrapolation order) and provide default, well‑documented choices.
  • For hardware providers: Offering a stable calibration API and periodic drift diagnostics can help users understand when mitigation results are trustworthy.
  • For benchmarking suites: Incorporate longitudinal runs and robustness sweeps as part of the standard test harness; automatically compute confidence intervals and effect sizes rather than just reporting raw averages.
  • For product roadmaps: Companies planning to market “error‑mitigated” quantum services need to back performance claims with statistically sound evidence, otherwise they risk overpromising to customers.
  • For open‑source contributions: The authors’ checklist can be adopted as a contribution guideline for repositories that host QEM experiments, ensuring reproducibility and transparent reporting.

Limitations & Future Work

  • Scope limited to ZNE: While ZNE is representative, other mitigation families (e.g., probabilistic error cancellation, Clifford data regression) may exhibit different sensitivities; extending the analysis to those methods is an open task.
  • Single hardware platform: Experiments were performed on one superconducting device; ion‑trap or photonic platforms might show distinct drift patterns.
  • Configuration granularity: The sweep covered a practical subset of hyper‑parameters; exhaustive exploration of all possible scale‑factor sequences could reveal additional edge cases.
  • Statistical models: The study relied on classical parametric tests; future work could explore Bayesian hierarchical models that better capture temporal correlations in hardware drift.

By highlighting these gaps, the authors invite the community to adopt more rigorous, reproducible evaluation pipelines—paving the way for truly reliable quantum error mitigation in the near‑term era.

Authors

  • Dominik Köster
  • Wolfgang Mauerer

Paper Information

  • arXiv ID: 2605.29872v1
  • Categories: quant-ph, cs.SE
  • Published: May 28, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »