[Paper] FastMPS: Revisit Data Parallel in Large-scale Matrix Product State Sampling

Published: (December 23, 2025 at 12:33 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.20064v1

Overview

Fast‑MPS revives data‑parallelism for sampling from massive Matrix Product States (MPS) by marrying it with a new “tensor‑parallel” layer that splits work along the bond dimension. The authors demonstrate that this two‑level parallelism shatters the memory and I/O bottlenecks that have limited previous MPS simulators, enabling unprecedentedly large quantum‑sampling runs (8 k+ sites, χ = 10⁴) with more than a 10× speed‑up.

Key Contributions

  • Multi‑level parallel framework: combines classic data parallelism (across independent samples) with a novel tensor‑parallelism that distributes the heavy tensor contractions along the MPS bond dimension.
  • Memory‑ and I/O‑aware design: introduces on‑the‑fly compression and overlapped communication/computation to keep the per‑process memory footprint low and hide data movement latency.
  • Scalable implementation: built on MPI + NCCL (or similar high‑performance collectives) and runs efficiently on thousands of CPU/GPU processes.
  • Benchmark on Gaussian Boson Sampling (GBS): achieves >10× speed‑up over state‑of‑the‑art simulators and pushes the frontier to 8,176 lattice sites with χ = 10⁴.
  • Open‑source reference: the authors release a prototype library that can be plugged into existing tensor‑network toolkits.

Methodology

  1. Problem decomposition

    • Data parallelism: each MPI rank generates a subset of the total MPS samples independently.
    • Tensor parallelism: within a rank, the MPS tensors are split across a second dimension of processes that collectively hold different slices of the bond index (the χ dimension).
  2. Compressed tensor storage

    • Before distribution, each tensor is quantized/compressed (e.g., using low‑rank approximation or block‑sparse encoding) to reduce memory usage without sacrificing sampling fidelity.
  3. Overlap of communication & computation

    • While one slice of the tensor is being contracted, the next slice is streamed in from remote ranks. This pipelining is orchestrated with non‑blocking MPI calls and CUDA streams (when GPUs are used).
  4. Collective contraction engine

    • A custom all‑reduce / all‑gather pattern aggregates partial results from the tensor‑parallel group, then proceeds to the next site in the MPS chain.
  5. Sampling loop

    • The algorithm walks through the MPS sites, performing a series of conditional probability calculations (via tensor contractions) to draw each physical index, exactly as in standard MPS sampling but now fully parallelized.

The overall workflow can be visualized as a two‑dimensional grid of processes: rows handle different samples, columns handle different χ‑slices of the same sample.

Results & Findings

MetricPrior art (e.g., QTensor, Quimb)Fast‑MPS
SpeedupBaseline (1×)>10× on 1,024‑GPU cluster
ScalabilityStalls beyond ~200 processes (memory pressure)Linear up to 4,096 processes
Maximum problem size~4,000 sites, χ ≈ 2 × 10³8,176 sites, χ = 10⁴
Memory per rank~30 GB (GPU)< 8 GB (thanks to compression)
I/O overheadDominates runtime (>30 %)< 5 % (overlapped)

The authors also show that the statistical properties of the sampled distributions (e.g., photon‑number histograms in GBS) match those of the exact MPS within negligible error, confirming that compression does not degrade scientific correctness.

Practical Implications

  • Quantum‑sampling research: Researchers can now simulate far larger GBS instances, aiding verification of near‑term quantum photonic devices and benchmarking quantum advantage claims.
  • Tensor‑network libraries: Fast‑MPS’s two‑level parallelism can be abstracted as a backend for popular Python packages (e.g., tensornetwork, quimb), giving developers a drop‑in performance boost for any MPS‑based workflow.
  • High‑performance ML: MPS is gaining traction as a compact model for sequence data; Fast‑MPS makes training/inference on massive datasets feasible on existing HPC clusters.
  • Memory‑constrained environments: The compression + overlap strategy can be adapted to other large‑tensor workloads (e.g., deep‑learning model parallelism, scientific simulations) where I/O is a bottleneck.
  • Scalable cloud deployments: Because the approach relies on standard MPI/NCCL primitives, it can be ported to cloud‑based GPU farms (AWS, Azure) without custom hardware, enabling on‑demand large‑scale tensor sampling as a service.

Limitations & Future Work

  • Compression trade‑offs: While the authors report minimal impact on sampling fidelity, aggressive compression may affect more delicate quantum‑state properties; a systematic error analysis is still needed.
  • Hardware heterogeneity: The current implementation assumes a fairly homogeneous CPU/GPU cluster; extending to mixed‑precision or heterogeneous node configurations could be non‑trivial.
  • Generality beyond MPS: Fast‑MPS is tailored to the linear chain structure of MPS; applying the same ideas to higher‑dimensional tensor networks (PEPS, MERA) will require new communication patterns.
  • Automation of parallel layout: Choosing the optimal split between data‑ and tensor‑parallel groups currently relies on manual tuning; an auto‑tuner could make the framework more user‑friendly.

Overall, Fast‑MPS opens a practical pathway to scale MPS sampling to problem sizes that were previously out of reach, and its design principles are likely to influence a broader class of high‑performance tensor computations.

Authors

  • Yaojian Chen
  • Si‑Qiu Gong
  • Lin Gan
  • Yanfei Liu
  • An Yang
  • Yinuo Wang
  • Chao‑yang Lu
  • Guangwen Yang

Paper Information

  • arXiv ID: 2512.20064v1
  • Categories: cs.DC
  • Published: December 23, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »