[Paper] Parallelizing Large-Scale Tensor Network Contraction on Multiple GPUs

Published: 3 days ago (June 1, 2026 at 04:02 AM EDT)

5 min read

Source: arXiv

Source: arXiv - 2606.01852v1

Overview

The paper introduces a new multi‑GPU framework for exact tensor‑network contraction, a core operation behind quantum‑circuit simulators, quantum‑error‑correction tools, and many combinatorial‑optimization algorithms. By moving away from the traditional “slicing” parallelism and instead distributing intermediate tensors across GPUs with explicit communication, the authors achieve dramatic speedups on both a single DGX‑H100 node and on clusters with up to 1 024 GPUs.

Key Contributions

Communication‑aware distribution model: Converts a static contraction path into a schedule that balances compute and inter‑GPU data movement.
GEMM‑oriented mode reordering: Reorders tensor dimensions to maximize the use of high‑throughput matrix‑multiply kernels (GEMM) on GPUs.
Scalable multi‑GPU implementation: Demonstrates up to 173× speedup on a single 8‑GPU node and up to 67 869× extra speedup when scaling to 1 024 H100 GPUs, far beyond what slicing can achieve.
Comprehensive evaluation: Benchmarks four large‑scale tensor‑network workloads (quantum circuit simulation, error‑correction decoding, combinatorial optimization, many‑body dynamics) on NVLink‑connected DGX systems and InfiniBand clusters.
Open‑source reference implementation (released alongside the paper) that integrates with existing tensor‑network libraries.

Methodology

Fixed contraction path – The authors start with a contraction order generated by a standard optimizer (e.g., opt‑einsum).
Mode (dimension) analysis – Each tensor’s axes are classified as local (can stay on the same GPU) or remote (must be communicated).
GEMM‑oriented reordering – Axes are permuted so that the bulk of each contraction becomes a dense matrix‑multiply, which GPUs execute extremely efficiently.
Communication‑aware distribution planning – Using a lightweight cost model (compute vs. NVLink/InfiniBand bandwidth), the framework decides how to split intermediate tensors across GPUs, minimizing the amount of data that must travel.
Explicit data movement – The schedule inserts NCCL‑based AllGather, ReduceScatter, or point‑to‑point transfers at the right points, ensuring that each GPU always has the pieces it needs for the next GEMM.
Execution engine – A thin runtime orchestrates the compute kernels and communication steps, overlapping them where possible to hide latency.

The approach is deliberately hardware‑aware: on a single DGX node the ultra‑fast NVLink makes communication cheap, while on larger clusters the planner accounts for slower InfiniBand links to keep the overall runtime dominated by computation.

Results & Findings

Platform	GPUs	Workload	Speedup vs. slicing*	Compute capture
DGX‑H100	8	Quantum circuit (≈30‑qubit)	7–173×	87–101 %
H100 Cluster	1 024	Quantum error correction	42×	90 %
H100 Cluster	1 024	Combinatorial optimization	67 869×	99 %
…	…	Many‑body dynamics	≈10⁴×	95 %

*Baseline = embarrassingly parallel slicing (each slice runs independently on a GPU).

Key takeaways:

Communication is no longer the bottleneck on modern NVLink‑connected nodes; the framework extracts almost the full theoretical compute reduction.
On large clusters, the extra speedup grows super‑linearly because slicing’s exponential growth in slice count quickly exhausts memory, whereas distribution keeps the tensor size manageable.
The method works for heterogeneous contraction graphs, not just regular lattice‑type networks, showing broad applicability.

Practical Implications

Quantum‑circuit simulators (e.g., Qiskit‑Aer, Cirq) can now push beyond the 30‑qubit barrier on commodity GPU clusters without resorting to approximate methods.
Error‑correction research gains the ability to simulate full‑scale surface‑code decoders in realistic time, accelerating the design of fault‑tolerant architectures.
Combinatorial‑optimization solvers that encode problems as tensor networks (e.g., Max‑Cut, SAT) can exploit massive parallelism, opening the door to near‑real‑time solutions for industry‑scale instances.
Software developers can integrate the framework via a thin API that mirrors existing tensor‑network libraries, reusing their contraction path planners while gaining multi‑GPU speedups automatically.
The communication‑aware scheduling ideas are transferable to other GPU‑heavy workloads (large‑scale linear algebra, deep‑learning model parallelism), suggesting a broader impact beyond quantum‑focused domains.

Limitations & Future Work

Memory pressure: Distributing large intermediate tensors still requires each GPU to hold a sizable chunk; extremely deep networks may exceed per‑GPU memory even after distribution.
Static planning: The schedule is generated once before execution; dynamic load imbalance (e.g., due to hardware throttling) is not currently mitigated.
Hardware dependence: The biggest gains rely on high‑bandwidth NVLink; on clusters with only PCIe or slower interconnects the communication model would need retuning.
Future directions proposed by the authors include: adaptive runtime re‑balancing, integration with automatic contraction‑path optimizers that co‑optimize for distribution, and extending the framework to heterogeneous accelerators (e.g., AMD GPUs, TPUs).

Bottom line: By treating tensor‑network contraction as a communication‑aware distributed GEMM problem, the authors unlock orders‑of‑magnitude speedups on modern GPU hardware, turning a traditionally memory‑bound, slice‑heavy workflow into a scalable compute‑driven pipeline that can be adopted by developers building next‑generation quantum‑simulation and optimization tools.

Authors

Feng Pan
Hanfeng Gu
Paul Springer
Xipeng Li

Paper Information

arXiv ID: 2606.01852v1
Categories: cs.DC, quant-ph
Published: June 1, 2026
PDF: Download PDF

[Paper] Parallelizing Large-Scale Tensor Network Contraction on Multiple GPUs

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Graph Traversal on Tensor Cores: A BFS Framework for Modern GPUs

[Paper] The local complexity of certifying parity

[Paper] The Usefulness Gap in Proof-of-Useful-Work: An Empirical Study of Pearl's cuPOW Protocol

[Paper] Clownfish: Scaling DAG-based BFT Consensus via Sparse Edges