[Paper] High-performance Vector-length Agnostic Quantum Circuit Simulations on ARM Processors
Source: arXiv
Source: arXiv:2602.09604v1
Overview
This paper investigates whether quantum‑state‑vector simulations—a core workload for quantum‑computing research—can run efficiently on emerging vector‑length agnostic (VLA) architectures such as:
- ARM’s Scalable Vector Extension (SVE)
- RISC‑V’s Vector Extension (RVV)
By designing a single‑source implementation that automatically adapts to any vector length, the authors demonstrate sizable speedups on three modern ARM‑based CPUs:
- NVIDIA Grace
- AWS Graviton 3
- Fujitsu A64FX
The results show that high‑performance portability is within reach for quantum‑simulation tools.
Key Contributions
VLA‑aware design for quantum‑state simulations – a single code base that scales with any vector length without recompilation.
Four novel optimization techniques tailored to VLA hardware:
- VLEN‑adaptive memory layout – reorganizes state‑vector data to match the runtime vector width.
- Load buffering – hides memory latency by pre‑fetching and staging vector loads.
- Fine‑grained loop control – splits loops to keep vector lanes busy even when the problem size is not a multiple of VLEN.
- Gate‑fusion‑based arithmetic‑intensity adaptation – merges consecutive quantum gates to increase the compute‑to‑memory ratio when vector lanes are under‑utilized.
Instrumentation framework – new performance‑monitoring‑unit (PMU) events and metrics that expose vectorization activity on ARM SVE/RVV.
Empirical evaluation – integrated with Google’s Qsim simulator and benchmarked on five circuits up to 36 qubits, achieving up to:
- 4.5× speedup on A64FX,
- 2.5× speedup on Grace,
- 1.5× speedup on Graviton 3.
Methodology
- Baseline – The authors start from Qsim’s existing implementation, which assumes a fixed SIMD width (e.g., AVX‑512).
- Vector‑Length‑Agnostic Refactor – Replace all SIMD intrinsics with SVE/RVV‑compatible intrinsics that query the hardware’s current VLEN at runtime.
- Memory‑Layout Adaptation – Store the quantum‑state vector ( 2ⁿ complex amplitudes) in a layout that can be tiled to match the vector width, reducing stride penalties.
- Load Buffering & Loop Splitting – Use a double‑buffer scheme to pre‑load the next chunk of amplitudes while the current chunk is being processed; split loops into “full‑vector” and “tail” parts to keep lanes busy.
- Gate Fusion – Combine sequences of single‑qubit and two‑qubit gates that act on the same qubits into a single matrix multiplication, raising arithmetic intensity when vector lanes would otherwise be idle.
- Instrumentation – Add custom PMU counters (e.g.,
vector-load-issued,vector-store-completed) to quantify how often the hardware’s vector engine is utilized. - Evaluation – Run the optimized Qsim on three ARM CPUs, each with a different native VLEN (128‑bit, 256‑bit, 512‑bit). Measure performance in wall‑clock time, FLOPs, and the new VLA metrics.
Results & Findings
| Processor | Native VLEN | Speedup vs. baseline | Key driver |
|---|---|---|---|
| Fujitsu A64FX | 512‑bit (SVE) | 4.5× | Aggressive memory‑layout adaptation + gate fusion |
| NVIDIA Grace | 256‑bit (SVE) | 2.5× | Load buffering + fine‑grained loop control |
| AWS Graviton 3 | 128‑bit (SVE) | 1.5× | VLEN‑adaptive layout (limited by smaller vector width) |
- Scalability: Speedups grow with vector width, confirming that the VLA design extracts more parallelism on wider vectors.
- Portability: A single source file compiled for each target yields near‑optimal performance, eliminating the need for hand‑tuned, architecture‑specific kernels.
- Instrumentation insights: The new PMU events reveal that on A64FX > 80 % of vector lanes stay busy throughout the simulation, while on Graviton 3 the utilization hovers around 45 % due to the smaller VLEN.
Practical Implications
- Quantum‑software developers can now target a broader range of ARM‑based cloud instances (e.g., AWS Graviton, Azure Arm) without maintaining separate SIMD code paths.
- Tool vendors (e.g., Qiskit, Cirq, Google Qsim) can integrate the VLA kernels to offer an “ARM‑optimized” simulation mode, giving users cost‑effective access to larger qubit counts on commodity hardware.
- Hardware architects gain concrete evidence that VLA‑friendly memory layouts and gate fusion are essential for extracting performance from SVE/RVV, informing future ISA extensions and micro‑architectural optimizations.
- Performance engineers can adopt the presented PMU metrics to profile other VLA workloads (e.g., deep‑learning kernels, scientific simulations) and identify vector‑utilization bottlenecks early in the development cycle.
Limitations & Future Work
- Circuit‑size ceiling: Experiments stop at 36 qubits; scaling beyond 40 qubits will stress memory bandwidth and may require out‑of‑core techniques.
- VLEN discovery overhead: Querying VLEN at runtime adds a small constant cost. The authors suggest compile‑time specialization for known VLENs as a possible improvement.
- RISC‑V RVV support: The study focuses on ARM SVE; extending the implementation to RISC‑V RVV (with its different predication model) is left for future work.
- Dynamic workload adaptation: The current gate‑fusion strategy is static; a runtime scheduler that decides when to fuse based on observed vector utilization could yield further gains.
Overall, the paper demonstrates that a thoughtfully engineered VLA approach can deliver portable, high‑performance quantum‑state simulations on today’s ARM processors—an encouraging sign for both the quantum‑computing and heterogeneous‑computing communities.
Authors
- Maya Gokhale
- Andreas Herten
- Pei‑Hung Lin
- Ivy Peng
- Gabin Schieffer
- Ruimin Shi
Paper Information
| Item | Details |
|---|---|
| arXiv ID | 2602.09604v1 |
| Categories | cs.DC |
| Published | February 10, 2026 |
| Download PDF |