[Paper] Replay-buffer engineering for noise-robust quantum circuit optimization
Source: arXiv - 2604.21863v1
Overview
The paper tackles a practical pain point for anyone trying to use deep reinforcement learning (RL) to design and optimize quantum circuits: the way experiences (state‑action‑reward tuples) are stored and reused can make or break learning efficiency, especially when real‑world hardware noise is involved. By re‑thinking the replay buffer—a core component of most RL pipelines—the authors deliver dramatic speed‑ups and more compact quantum programs, while also showing how to transfer knowledge from noiseless simulations to noisy hardware without costly retraining.
Key Contributions
- ReaPER⁺ (Reliability‑aware Prioritized Experience Replay) – an annealed replay rule that starts with classic TD‑error prioritization and gradually shifts to sampling based on the reliability of value estimates, yielding 4‑32× better sample efficiency.
- OptCRLQAS (Optimized Curriculum RL for Quantum‑Architecture Search) – a curriculum‑learning scheme that batches expensive quantum‑classical evaluations across multiple architecture edits, cutting per‑episode wall‑clock time by up to 67.5 % on a 12‑qubit benchmark.
- Lightweight Replay‑Buffer Transfer – a method to warm‑start learning under noisy hardware by re‑using noiseless trajectories directly from the buffer (no weight copying, no ε‑greedy pre‑training), slashing steps to chemical accuracy by 85‑90 % and reducing final energy error by ~90 % on molecular tasks.
- Domain‑agnostic validation – the same replay‑buffer ideas improve performance on a classic RL benchmark (LunarLander‑v3), confirming that the technique is not limited to quantum problems.
Methodology
- Replay‑buffer redesign – Traditional Prioritized Experience Replay (PER) samples experiences proportionally to their TD error, assuming larger errors mean more learning value. The authors observe that as training progresses, TD errors become noisy indicators of true learning potential, especially when the value network is still immature. ReaPER⁺ therefore anneals: early epochs use TD‑error prioritization; later epochs switch to a reliability score derived from the variance of value predictions across recent updates.
- Curriculum‑based architecture search – Instead of evaluating a new circuit after every single edit (which requires a full quantum‑classical simulation), OptCRLQAS groups a batch of edits, runs a single expensive evaluation, and propagates the resulting reward to all buffered experiences generated during that batch. This amortizes the cost.
- Transfer via buffer reuse – When moving from a noiseless simulator to a noisy quantum device, the method simply copies the trajectory entries (states, actions, rewards) from the noiseless buffer into the noisy buffer. The RL agent continues learning with the same network weights, letting the noisy environment naturally re‑weight the experiences through the new reliability‑aware sampling.
All three components are integrated into a standard deep Q‑learning loop (or its policy‑gradient variants) with minimal changes to the underlying neural architecture.
Results & Findings
| Benchmark | Metric | Baseline | ReaPER⁺ | OptCRLQAS | Transfer (Noisy) |
|---|---|---|---|---|---|
| Quantum compilation (12‑qubit) | Sample efficiency (episodes to target depth) | 1.0× | 4–32× improvement | – (same RL core) | – |
| QAS (Quantum Architecture Search) | Wall‑clock time per episode | 1.0 | – | ‑67.5 % reduction | – |
| Molecular energy (6‑, 8‑, 12‑qubit) | Steps to chemical accuracy | 1.0 | – | – | ‑85‑90 % |
| LunarLander‑v3 (classical RL) | Average reward after 500k steps | 200 | +12 % | – | – |
- More compact circuits: Across all quantum compilation tasks, ReaPER⁺ consistently discovers circuits with fewer gates and lower depth than uniform or fixed‑PER replay.
- Robustness to noise: The transfer scheme brings the noisy‑hardware performance within ~10 % of the noiseless optimum, a huge gain given that hardware noise typically inflates energy errors dramatically.
- Scalability: The wall‑clock savings from OptCRLQAS become more pronounced as qubit count grows, indicating the approach will stay beneficial for near‑term devices (20‑30 qubits) and beyond.
Practical Implications
- Faster prototyping for quantum software engineers – By slashing the number of simulator calls, developers can iterate on circuit optimizations in hours rather than days, making RL‑based compilers viable for production pipelines.
- Cost‑effective hardware experiments – The buffer‑transfer method means you can train a policy on cheap noiseless simulators and then “drop it in” to a real quantum processor with minimal extra training, saving precious quantum‑hardware time (which is often billed by the minute).
- Cross‑domain RL improvements – Since the annealed replay rule helped on LunarLander, any RL system that suffers from evolving TD‑error reliability (e.g., robotics, autonomous driving) could adopt ReaPER⁺ without quantum‑specific changes.
- Tooling integration – The techniques are lightweight enough to be added as plug‑ins to existing RL libraries (e.g., Stable‑Baselines3, RLlib). Developers only need to expose a reliability estimator for their value network and adjust the replay‑buffer sampling schedule.
Limitations & Future Work
- Reliability estimator overhead – Computing variance‑based reliability adds a small per‑step cost; on extremely high‑throughput environments this could become a bottleneck.
- Curriculum batch size tuning – OptCRLQAS requires choosing how many architecture edits to group before evaluation; sub‑optimal batch sizes can either waste computation or degrade learning signal.
- Hardware‑specific noise models – The transfer experiments used a generic depolarizing noise model. Real devices exhibit correlated and non‑Markovian errors, so further validation on actual quantum hardware is needed.
- Extending beyond Q‑learning – While the paper demonstrates the approach with deep Q‑networks, applying the annealed replay rule to actor‑critic or policy‑gradient methods remains an open research direction.
Overall, the work shows that “how we store and reuse experience” is as crucial as the neural architecture itself for scaling RL‑driven quantum circuit design, opening a clear path for developers to bring these methods into real‑world quantum software stacks.
Authors
- Akash Kundu
- Sebastian Feld
Paper Information
- arXiv ID: 2604.21863v1
- Categories: quant-ph, cs.AI, cs.ET, cs.LG
- Published: April 23, 2026
- PDF: Download PDF