[Paper] High-performance Vector-length Agnostic Quantum Circuit Simulations on ARM Processors

Published: 2 months ago (February 10, 2026 at 04:55 AM EST)

5 min read

Source: arXiv

Source: arXiv

Overview

This paper investigates whether quantum‑state‑vector simulations—a core workload for quantum‑computing research—can run efficiently on emerging vector‑length agnostic (VLA) architectures such as:

ARM’s Scalable Vector Extension (SVE)
RISC‑V’s Vector Extension (RVV)

By designing a single‑source implementation that automatically adapts to any vector length, the authors demonstrate sizable speedups on three modern ARM‑based CPUs:

NVIDIA Grace
AWS Graviton 3
Fujitsu A64FX

The results show that high‑performance portability is within reach for quantum‑simulation tools.

Key Contributions

VLA‑aware design for quantum‑state simulations – a single code base that scales with any vector length without recompilation.
Four novel optimization techniques tailored to VLA hardware:
1. VLEN‑adaptive memory layout – reorganizes state‑vector data to match the runtime vector width.
2. Load buffering – hides memory latency by pre‑fetching and staging vector loads.
3. Fine‑grained loop control – splits loops to keep vector lanes busy even when the problem size is not a multiple of VLEN.
4. Gate‑fusion‑based arithmetic‑intensity adaptation – merges consecutive quantum gates to increase the compute‑to‑memory ratio when vector lanes are under‑utilized.
Instrumentation framework – new performance‑monitoring‑unit (PMU) events and metrics that expose vectorization activity on ARM SVE/RVV.
Empirical evaluation – integrated with Google’s Qsim simulator and benchmarked on five circuits up to 36 qubits, achieving up to:
- 4.5× speedup on A64FX,
- 2.5× speedup on Grace,
- 1.5× speedup on Graviton 3.

Methodology

Baseline – The authors start from Qsim’s existing implementation, which assumes a fixed SIMD width (e.g., AVX‑512).
Vector‑Length‑Agnostic Refactor – Replace all SIMD intrinsics with SVE/RVV‑compatible intrinsics that query the hardware’s current VLEN at runtime.
Memory‑Layout Adaptation – Store the quantum‑state vector ( 2ⁿ complex amplitudes) in a layout that can be tiled to match the vector width, reducing stride penalties.
Load Buffering & Loop Splitting – Use a double‑buffer scheme to pre‑load the next chunk of amplitudes while the current chunk is being processed; split loops into “full‑vector” and “tail” parts to keep lanes busy.
Gate Fusion – Combine sequences of single‑qubit and two‑qubit gates that act on the same qubits into a single matrix multiplication, raising arithmetic intensity when vector lanes would otherwise be idle.
Instrumentation – Add custom PMU counters (e.g., vector-load-issued, vector-store-completed) to quantify how often the hardware’s vector engine is utilized.
Evaluation – Run the optimized Qsim on three ARM CPUs, each with a different native VLEN (128‑bit, 256‑bit, 512‑bit). Measure performance in wall‑clock time, FLOPs, and the new VLA metrics.

Results & Findings

Processor	Native VLEN	Speedup vs. baseline	Key driver
Fujitsu A64FX	512‑bit (SVE)	4.5×	Aggressive memory‑layout adaptation + gate fusion
NVIDIA Grace	256‑bit (SVE)	2.5×	Load buffering + fine‑grained loop control
AWS Graviton 3	128‑bit (SVE)	1.5×	VLEN‑adaptive layout (limited by smaller vector width)

Scalability: Speedups grow with vector width, confirming that the VLA design extracts more parallelism on wider vectors.
Portability: A single source file compiled for each target yields near‑optimal performance, eliminating the need for hand‑tuned, architecture‑specific kernels.
Instrumentation insights: The new PMU events reveal that on A64FX > 80 % of vector lanes stay busy throughout the simulation, while on Graviton 3 the utilization hovers around 45 % due to the smaller VLEN.

Practical Implications

Quantum‑software developers can now target a broader range of ARM‑based cloud instances (e.g., AWS Graviton, Azure Arm) without maintaining separate SIMD code paths.
Tool vendors (e.g., Qiskit, Cirq, Google Qsim) can integrate the VLA kernels to offer an “ARM‑optimized” simulation mode, giving users cost‑effective access to larger qubit counts on commodity hardware.
Hardware architects gain concrete evidence that VLA‑friendly memory layouts and gate fusion are essential for extracting performance from SVE/RVV, informing future ISA extensions and micro‑architectural optimizations.
Performance engineers can adopt the presented PMU metrics to profile other VLA workloads (e.g., deep‑learning kernels, scientific simulations) and identify vector‑utilization bottlenecks early in the development cycle.

Limitations & Future Work

Circuit‑size ceiling: Experiments stop at 36 qubits; scaling beyond 40 qubits will stress memory bandwidth and may require out‑of‑core techniques.
VLEN discovery overhead: Querying VLEN at runtime adds a small constant cost. The authors suggest compile‑time specialization for known VLENs as a possible improvement.
RISC‑V RVV support: The study focuses on ARM SVE; extending the implementation to RISC‑V RVV (with its different predication model) is left for future work.
Dynamic workload adaptation: The current gate‑fusion strategy is static; a runtime scheduler that decides when to fuse based on observed vector utilization could yield further gains.

Overall, the paper demonstrates that a thoughtfully engineered VLA approach can deliver portable, high‑performance quantum‑state simulations on today’s ARM processors—an encouraging sign for both the quantum‑computing and heterogeneous‑computing communities.

Authors

Maya Gokhale
Andreas Herten
Pei‑Hung Lin
Ivy Peng
Gabin Schieffer
Ruimin Shi

Paper Information

Item	Details
arXiv ID	`2602.09604v1`
Categories	`cs.DC`
Published	February 10, 2026
PDF	Download PDF

[Paper] High-performance Vector-length Agnostic Quantum Circuit Simulations on ARM Processors

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Architectural Foundations for Checkpointing and Restoration in Quantum HPC Systems

MagicX Two Dream teaser shows off the upcoming gaming handheld from all angles

There's a dedicated channel for Formula 1 in the Apple TV app now

More Rode mics can now connect directly to iPhones and iPads