[Paper] High-performance Vector-length Agnostic Quantum Circuit Simulations on ARM Processors

Published: (February 10, 2026 at 04:55 AM EST)
5 min read
Source: arXiv

Source: arXiv

Source: arXiv:2602.09604v1

Overview

This paper investigates whether quantum‑state‑vector simulations—a core workload for quantum‑computing research—can run efficiently on emerging vector‑length agnostic (VLA) architectures such as:

  • ARM’s Scalable Vector Extension (SVE)
  • RISC‑V’s Vector Extension (RVV)

By designing a single‑source implementation that automatically adapts to any vector length, the authors demonstrate sizable speedups on three modern ARM‑based CPUs:

  1. NVIDIA Grace
  2. AWS Graviton 3
  3. Fujitsu A64FX

The results show that high‑performance portability is within reach for quantum‑simulation tools.

Key Contributions

  • VLA‑aware design for quantum‑state simulations – a single code base that scales with any vector length without recompilation.

  • Four novel optimization techniques tailored to VLA hardware:

    1. VLEN‑adaptive memory layout – reorganizes state‑vector data to match the runtime vector width.
    2. Load buffering – hides memory latency by pre‑fetching and staging vector loads.
    3. Fine‑grained loop control – splits loops to keep vector lanes busy even when the problem size is not a multiple of VLEN.
    4. Gate‑fusion‑based arithmetic‑intensity adaptation – merges consecutive quantum gates to increase the compute‑to‑memory ratio when vector lanes are under‑utilized.
  • Instrumentation framework – new performance‑monitoring‑unit (PMU) events and metrics that expose vectorization activity on ARM SVE/RVV.

  • Empirical evaluation – integrated with Google’s Qsim simulator and benchmarked on five circuits up to 36 qubits, achieving up to:

    • 4.5× speedup on A64FX,
    • 2.5× speedup on Grace,
    • 1.5× speedup on Graviton 3.

Methodology

  1. Baseline – The authors start from Qsim’s existing implementation, which assumes a fixed SIMD width (e.g., AVX‑512).
  2. Vector‑Length‑Agnostic Refactor – Replace all SIMD intrinsics with SVE/RVV‑compatible intrinsics that query the hardware’s current VLEN at runtime.
  3. Memory‑Layout Adaptation – Store the quantum‑state vector ( 2ⁿ complex amplitudes) in a layout that can be tiled to match the vector width, reducing stride penalties.
  4. Load Buffering & Loop Splitting – Use a double‑buffer scheme to pre‑load the next chunk of amplitudes while the current chunk is being processed; split loops into “full‑vector” and “tail” parts to keep lanes busy.
  5. Gate Fusion – Combine sequences of single‑qubit and two‑qubit gates that act on the same qubits into a single matrix multiplication, raising arithmetic intensity when vector lanes would otherwise be idle.
  6. Instrumentation – Add custom PMU counters (e.g., vector-load-issued, vector-store-completed) to quantify how often the hardware’s vector engine is utilized.
  7. Evaluation – Run the optimized Qsim on three ARM CPUs, each with a different native VLEN (128‑bit, 256‑bit, 512‑bit). Measure performance in wall‑clock time, FLOPs, and the new VLA metrics.

Results & Findings

ProcessorNative VLENSpeedup vs. baselineKey driver
Fujitsu A64FX512‑bit (SVE)4.5×Aggressive memory‑layout adaptation + gate fusion
NVIDIA Grace256‑bit (SVE)2.5×Load buffering + fine‑grained loop control
AWS Graviton 3128‑bit (SVE)1.5×VLEN‑adaptive layout (limited by smaller vector width)
  • Scalability: Speedups grow with vector width, confirming that the VLA design extracts more parallelism on wider vectors.
  • Portability: A single source file compiled for each target yields near‑optimal performance, eliminating the need for hand‑tuned, architecture‑specific kernels.
  • Instrumentation insights: The new PMU events reveal that on A64FX > 80 % of vector lanes stay busy throughout the simulation, while on Graviton 3 the utilization hovers around 45 % due to the smaller VLEN.

Practical Implications

  • Quantum‑software developers can now target a broader range of ARM‑based cloud instances (e.g., AWS Graviton, Azure Arm) without maintaining separate SIMD code paths.
  • Tool vendors (e.g., Qiskit, Cirq, Google Qsim) can integrate the VLA kernels to offer an “ARM‑optimized” simulation mode, giving users cost‑effective access to larger qubit counts on commodity hardware.
  • Hardware architects gain concrete evidence that VLA‑friendly memory layouts and gate fusion are essential for extracting performance from SVE/RVV, informing future ISA extensions and micro‑architectural optimizations.
  • Performance engineers can adopt the presented PMU metrics to profile other VLA workloads (e.g., deep‑learning kernels, scientific simulations) and identify vector‑utilization bottlenecks early in the development cycle.

Limitations & Future Work

  • Circuit‑size ceiling: Experiments stop at 36 qubits; scaling beyond 40 qubits will stress memory bandwidth and may require out‑of‑core techniques.
  • VLEN discovery overhead: Querying VLEN at runtime adds a small constant cost. The authors suggest compile‑time specialization for known VLENs as a possible improvement.
  • RISC‑V RVV support: The study focuses on ARM SVE; extending the implementation to RISC‑V RVV (with its different predication model) is left for future work.
  • Dynamic workload adaptation: The current gate‑fusion strategy is static; a runtime scheduler that decides when to fuse based on observed vector utilization could yield further gains.

Overall, the paper demonstrates that a thoughtfully engineered VLA approach can deliver portable, high‑performance quantum‑state simulations on today’s ARM processors—an encouraging sign for both the quantum‑computing and heterogeneous‑computing communities.

Authors

  • Maya Gokhale
  • Andreas Herten
  • Pei‑Hung Lin
  • Ivy Peng
  • Gabin Schieffer
  • Ruimin Shi

Paper Information

ItemDetails
arXiv ID2602.09604v1
Categoriescs.DC
PublishedFebruary 10, 2026
PDFDownload PDF
0 views
Back to Blog

Related posts

Read more »