[Paper] Parallelizing Large-Scale Tensor Network Contraction on Multiple GPUs

Published: (June 1, 2026 at 04:02 AM EDT)
5 min read
Source: arXiv

Source: arXiv - 2606.01852v1

Overview

The paper introduces a new multi‑GPU framework for exact tensor‑network contraction, a core operation behind quantum‑circuit simulators, quantum‑error‑correction tools, and many combinatorial‑optimization algorithms. By moving away from the traditional “slicing” parallelism and instead distributing intermediate tensors across GPUs with explicit communication, the authors achieve dramatic speedups on both a single DGX‑H100 node and on clusters with up to 1 024 GPUs.

Key Contributions

  • Communication‑aware distribution model: Converts a static contraction path into a schedule that balances compute and inter‑GPU data movement.
  • GEMM‑oriented mode reordering: Reorders tensor dimensions to maximize the use of high‑throughput matrix‑multiply kernels (GEMM) on GPUs.
  • Scalable multi‑GPU implementation: Demonstrates up to 173× speedup on a single 8‑GPU node and up to 67 869× extra speedup when scaling to 1 024 H100 GPUs, far beyond what slicing can achieve.
  • Comprehensive evaluation: Benchmarks four large‑scale tensor‑network workloads (quantum circuit simulation, error‑correction decoding, combinatorial optimization, many‑body dynamics) on NVLink‑connected DGX systems and InfiniBand clusters.
  • Open‑source reference implementation (released alongside the paper) that integrates with existing tensor‑network libraries.

Methodology

  1. Fixed contraction path – The authors start with a contraction order generated by a standard optimizer (e.g., opt‑einsum).
  2. Mode (dimension) analysis – Each tensor’s axes are classified as local (can stay on the same GPU) or remote (must be communicated).
  3. GEMM‑oriented reordering – Axes are permuted so that the bulk of each contraction becomes a dense matrix‑multiply, which GPUs execute extremely efficiently.
  4. Communication‑aware distribution planning – Using a lightweight cost model (compute vs. NVLink/InfiniBand bandwidth), the framework decides how to split intermediate tensors across GPUs, minimizing the amount of data that must travel.
  5. Explicit data movement – The schedule inserts NCCL‑based AllGather, ReduceScatter, or point‑to‑point transfers at the right points, ensuring that each GPU always has the pieces it needs for the next GEMM.
  6. Execution engine – A thin runtime orchestrates the compute kernels and communication steps, overlapping them where possible to hide latency.

The approach is deliberately hardware‑aware: on a single DGX node the ultra‑fast NVLink makes communication cheap, while on larger clusters the planner accounts for slower InfiniBand links to keep the overall runtime dominated by computation.

Results & Findings

PlatformGPUsWorkloadSpeedup vs. slicing*Compute capture
DGX‑H1008Quantum circuit (≈30‑qubit)7–173×87–101 %
H100 Cluster1 024Quantum error correction42×90 %
H100 Cluster1 024Combinatorial optimization67 869×99 %
Many‑body dynamics≈10⁴×95 %

*Baseline = embarrassingly parallel slicing (each slice runs independently on a GPU).

Key takeaways:

  • Communication is no longer the bottleneck on modern NVLink‑connected nodes; the framework extracts almost the full theoretical compute reduction.
  • On large clusters, the extra speedup grows super‑linearly because slicing’s exponential growth in slice count quickly exhausts memory, whereas distribution keeps the tensor size manageable.
  • The method works for heterogeneous contraction graphs, not just regular lattice‑type networks, showing broad applicability.

Practical Implications

  • Quantum‑circuit simulators (e.g., Qiskit‑Aer, Cirq) can now push beyond the 30‑qubit barrier on commodity GPU clusters without resorting to approximate methods.
  • Error‑correction research gains the ability to simulate full‑scale surface‑code decoders in realistic time, accelerating the design of fault‑tolerant architectures.
  • Combinatorial‑optimization solvers that encode problems as tensor networks (e.g., Max‑Cut, SAT) can exploit massive parallelism, opening the door to near‑real‑time solutions for industry‑scale instances.
  • Software developers can integrate the framework via a thin API that mirrors existing tensor‑network libraries, reusing their contraction path planners while gaining multi‑GPU speedups automatically.
  • The communication‑aware scheduling ideas are transferable to other GPU‑heavy workloads (large‑scale linear algebra, deep‑learning model parallelism), suggesting a broader impact beyond quantum‑focused domains.

Limitations & Future Work

  • Memory pressure: Distributing large intermediate tensors still requires each GPU to hold a sizable chunk; extremely deep networks may exceed per‑GPU memory even after distribution.
  • Static planning: The schedule is generated once before execution; dynamic load imbalance (e.g., due to hardware throttling) is not currently mitigated.
  • Hardware dependence: The biggest gains rely on high‑bandwidth NVLink; on clusters with only PCIe or slower interconnects the communication model would need retuning.
  • Future directions proposed by the authors include: adaptive runtime re‑balancing, integration with automatic contraction‑path optimizers that co‑optimize for distribution, and extending the framework to heterogeneous accelerators (e.g., AMD GPUs, TPUs).

Bottom line: By treating tensor‑network contraction as a communication‑aware distributed GEMM problem, the authors unlock orders‑of‑magnitude speedups on modern GPU hardware, turning a traditionally memory‑bound, slice‑heavy workflow into a scalable compute‑driven pipeline that can be adopted by developers building next‑generation quantum‑simulation and optimization tools.

Authors

  • Feng Pan
  • Hanfeng Gu
  • Paul Springer
  • Xipeng Li

Paper Information

  • arXiv ID: 2606.01852v1
  • Categories: cs.DC, quant-ph
  • Published: June 1, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »