[Paper] Beyond Single-GPU: Scaling PDLP to Distributed Multi-GPU Systems

Published: (January 12, 2026 at 10:09 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.07628v1

Overview

The paper introduces a distributed‑memory implementation of the Primal‑Dual Hybrid Gradient (PDHG) algorithm that can run across many GPUs, breaking the memory and speed limits of existing single‑GPU LP solvers. By partitioning the constraint matrix onto a 2‑D grid of GPUs and using high‑performance NCCL communication, the authors demonstrate that massive linear programs—previously too large for a single accelerator—can be solved with full double‑precision accuracy and impressive speedups.

Key Contributions

  • Multi‑GPU PDHG framework: Extends the proven PDHG solver (cuPDLPx) from a single GPU to an arbitrary number of GPUs using a 2‑D grid partitioning of the LP constraint matrix.
  • Load‑balanced data distribution: Introduces a nonzero‑aware partitioning scheme together with a block‑wise random shuffling strategy to keep each GPU busy and avoid stragglers.
  • Communication‑efficient design: Leverages NVIDIA’s NCCL library and fused CUDA kernels to synchronize primal‑dual updates with minimal overhead.
  • Full FP64 numerical fidelity: Guarantees double‑precision results despite the distributed execution, a requirement for many industrial optimization pipelines.
  • Extensive empirical validation: Shows strong scalability on benchmarks from MIPLIB, Hans’ instances, and real‑world datasets, achieving up to several‑fold speedups over the best single‑GPU baselines.

Methodology

  1. Problem formulation – The LP is expressed in standard form, and the PDHG algorithm iteratively updates primal (variables) and dual (constraints) vectors using gradient steps and proximal operators.
  2. 2‑D grid partitioning – The large sparse constraint matrix (A) is sliced both row‑wise and column‑wise, assigning each sub‑matrix to a distinct GPU in a logical grid. This spreads both memory usage and compute work.
  3. Nonzero‑aware distribution – Instead of naïvely splitting rows/columns evenly, the algorithm first counts non‑zero entries per block and redistributes blocks so that each GPU receives roughly the same number of non‑zeros, which correlates with actual compute load.
  4. Block‑wise random shuffling – Within each GPU, blocks are processed in a random order each iteration, reducing correlation‑induced load imbalance and improving convergence stability.
  5. Communication layer – After each PDHG step, GPUs need to exchange boundary primal/dual values. This is done with NCCL’s all‑reduce and point‑to‑point primitives, wrapped in fused kernels that combine multiple small messages into a single GPU launch, cutting latency.
  6. Implementation – Built on top of the existing cuPDLPx codebase, the authors added the distributed data structures, communication routines, and kernel fusions while keeping the core algorithm unchanged.

Results & Findings

Test set#GPUsMemory per GPU (GB)Time (s) vs. single‑GPUSpeedupScaling efficiency
MIPLIB (large)4121.0× (baseline)3.2×80%
Hans’ instances881.0×6.5×81%
Real‑world logistics LP (≈ 1.2 B non‑zeros)166N/A (single‑GPU OOM)12.8×80%
  • Memory breakthrough: Problems that caused out‑of‑memory errors on a 32 GB GPU ran comfortably on a 4‑GPU cluster with each GPU holding ≤ 12 GB.
  • Communication overhead: NCCL communication accounted for < 12 % of total runtime even at 16 GPUs, confirming the effectiveness of the fused kernels and the 2‑D partitioning.
  • Numerical accuracy: All experiments retained FP64 solution quality, with residuals matching the single‑GPU solver within 1e‑12 relative tolerance.

Practical Implications

  • Scalable optimization services: Cloud providers can now expose LP‑solving as a multi‑GPU service without worrying about memory caps, enabling on‑demand solving of huge logistics, finance, or energy planning models.
  • Integration with existing pipelines: Because the implementation builds on cuPDLPx and uses standard CUDA/NCCL APIs, developers can drop it into current GPU‑accelerated data‑science stacks (e.g., RAPIDS, PyTorch) with minimal code changes.
  • Cost‑effective hardware utilization: Instead of provisioning a single massive GPU (which may be scarce or expensive), organizations can achieve comparable performance by aggregating several commodity GPUs, improving hardware ROI.
  • Foundation for other solvers: The 2‑D partitioning and load‑balancing ideas are transferable to other first‑order methods (e.g., ADMM, stochastic gradient) that suffer from similar memory bottlenecks.

Limitations & Future Work

  • Sparse matrix structure dependence: The current partitioner assumes a relatively uniform sparsity pattern; highly irregular matrices may still cause load imbalance.
  • Network topology sensitivity: Experiments were run on a high‑speed NVLink‑connected cluster; performance on Ethernet‑based multi‑node setups may degrade due to higher latency.
  • Algorithmic extensions: The authors plan to explore adaptive step‑size schemes and mixed‑precision variants to further reduce runtime while preserving accuracy.
  • Broader benchmark suite: Future work includes testing on emerging LP benchmarks from machine learning (e.g., large‑scale support‑vector machines) and integrating with higher‑level modeling languages.

Authors

  • Hongpei Li
  • Yicheng Huang
  • Huikang Liu
  • Dongdong Ge
  • Yinyu Ye

Paper Information

  • arXiv ID: 2601.07628v1
  • Categories: math.OC, cs.DC
  • Published: January 12, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »