[Paper] Claim against Measurement: Statistical Artefacts in Quantum Error Mitigation Benchmarks

Published: 1 week ago (May 28, 2026 at 08:56 AM EDT)

5 min read

Source: arXiv

Source: arXiv - 2605.29872v1

Overview

The paper Claim against Measurement: Statistical Artefacts in Quantum Error Mitigation Benchmarks examines how the community evaluates quantum error mitigation (QEM) techniques—especially zero‑noise extrapolation (ZNE)—and uncovers systematic statistical shortcomings that can inflate reported performance gains. By reviewing 81 recent QEM studies and conducting extensive experiments, the authors show that many benchmark results are fragile, often depending on hidden parameter choices or temporal drift of the hardware.

Key Contributions

Systematic literature audit – Applied an eight‑criterion framework (statistical rigor, reproducibility, reporting quality, etc.) to 81 QEM papers; only 25 % used proper inferential statistics.
Parameter‑sensitivity case study – Ran a 132‑configuration sweep of ZNE on a superconducting device, demonstrating that choices such as scale factors, extrapolation method, and calibration dramatically swing outcomes from “significant improvement” to “significant degradation.”
Drift‑induced illusion experiment – Conducted a 72‑hour longitudinal run on real hardware, revealing that temporal drift can inflate the apparent effect size of the same ZNE configuration by > 3× and reduces the effective number of independent samples.
Practical reporting checklist – Proposed a minimum‑standards guideline for QEM benchmarking (parameter documentation, robustness checks, drift assessment, effect‑size reporting, etc.).
Open‑source tooling – Released scripts and data used for the sweep and drift study, enabling other groups to reproduce and extend the analysis.

Methodology

Literature review – The authors built an eight‑criterion rubric covering: (a) use of inferential statistics, (b) uncertainty quantification, (c) reproducibility of code/data, (d) description of hardware calibration, (e) parameter disclosure, (f) robustness analysis, (g) longitudinal stability checks, and (h) effect‑size reporting. Each paper was scored against the rubric.
Experimental benchmark – They selected ZNE as a representative QEM technique because it is widely adopted and relatively simple to implement. Using a 5‑qubit superconducting device, they varied three key knobs:
- Scale factors (1.0, 1.5, 2.0, 3.0)
- Extrapolation method (linear, quadratic, Richardson)
- Calibration strategy (static vs. refreshed before each run)
  This produced 4 × 3 × 11 = 132 distinct configurations. For each configuration they ran a standard circuit (e.g., a depth‑10 variational ansatz) 30 times, recording raw outcomes and post‑processed mitigated results.
Longitudinal drift study – Over 72 hours, the same ZNE configuration was executed repeatedly while the underlying hardware parameters (gate error rates, T1/T2 times) were logged. The authors applied statistical time‑series analysis to quantify how drift altered the measured error‑mitigation gain.
Statistical analysis – They used both descriptive statistics (means, standard deviations) and inferential tests (paired t‑tests, bootstrap confidence intervals) to assess whether observed improvements were statistically significant, and they reported Cohen’s d as an effect‑size metric.

Results & Findings

Literature audit: Only 15 of the 81 papers (≈ 25 %) performed any inferential testing; 25 papers (≈ 42 %) reported uncertainties descriptively (e.g., “error bars”) but never tested significance. The remaining papers omitted uncertainty altogether.
Parameter sensitivity: The 132‑configuration sweep showed that the same mitigation algorithm could be judged either “effective” (p < 0.05, d ≈ 0.8) or “harmful” (p < 0.05, d ≈ ‑0.6) solely by changing the scale factor or extrapolation method. In 38 % of configurations the statistical conclusion flipped when the calibration schedule was altered.
Drift illusion: During the 72‑hour run, the measured mitigation gain varied from a modest 5 % improvement to a spurious 15 % improvement, solely due to hardware drift. The effective sample size dropped by ~ 30 % because successive measurements were correlated, violating the independence assumption of many statistical tests.
Effect‑size compression: When proper effect‑size reporting was applied, many “significant” improvements shrank to small or negligible practical impact (Cohen’s d < 0.2).

Overall, the study demonstrates that current QEM benchmark practices can overstate the robustness and utility of mitigation techniques.

Practical Implications

For developers: When integrating QEM into a quantum software stack (e.g., Qiskit, Cirq, Braket), you should expose all mitigation hyper‑parameters (scale factors, extrapolation order) and provide default, well‑documented choices.
For hardware providers: Offering a stable calibration API and periodic drift diagnostics can help users understand when mitigation results are trustworthy.
For benchmarking suites: Incorporate longitudinal runs and robustness sweeps as part of the standard test harness; automatically compute confidence intervals and effect sizes rather than just reporting raw averages.
For product roadmaps: Companies planning to market “error‑mitigated” quantum services need to back performance claims with statistically sound evidence, otherwise they risk overpromising to customers.
For open‑source contributions: The authors’ checklist can be adopted as a contribution guideline for repositories that host QEM experiments, ensuring reproducibility and transparent reporting.

Limitations & Future Work

Scope limited to ZNE: While ZNE is representative, other mitigation families (e.g., probabilistic error cancellation, Clifford data regression) may exhibit different sensitivities; extending the analysis to those methods is an open task.
Single hardware platform: Experiments were performed on one superconducting device; ion‑trap or photonic platforms might show distinct drift patterns.
Configuration granularity: The sweep covered a practical subset of hyper‑parameters; exhaustive exploration of all possible scale‑factor sequences could reveal additional edge cases.
Statistical models: The study relied on classical parametric tests; future work could explore Bayesian hierarchical models that better capture temporal correlations in hardware drift.

By highlighting these gaps, the authors invite the community to adopt more rigorous, reproducible evaluation pipelines—paving the way for truly reliable quantum error mitigation in the near‑term era.

Authors

Dominik Köster
Wolfgang Mauerer

Paper Information

arXiv ID: 2605.29872v1
Categories: quant-ph, cs.SE
Published: May 28, 2026
PDF: Download PDF

[Paper] Claim against Measurement: Statistical Artefacts in Quantum Error Mitigation Benchmarks

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Ladder Logic Translation using Large Language Models in Industrial Automation

[Paper] Governance-Aware Software Architecture for Multi-Stakeholder Platforms

[Paper] R+R: Reassessing Java Security API Misuse in Current LLMs: A Replication on JCA and JSSE APIs with External Security Knowledge

[Paper] What Breaks When LLMs Code? Characterizing Operational Safety Failures of Agentic Code Assistants