[Paper] Parallel Sparse and Data-Sparse Factorization-based Linear Solvers

Published: 3 days ago (February 15, 2026 at 02:40 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.14289v1

Overview

The paper by Xiaoye Sherry Li and Yang Liu surveys the latest breakthroughs in sparse direct linear solvers—the workhorses behind many large‑scale simulations, machine‑learning pipelines, and data‑science workloads. By focusing on two complementary fronts—communication‑aware parallelism and low‑rank/compressed algebra—the authors show how modern solvers can stay robust while scaling efficiently on today’s heterogeneous supercomputers.

Key Contributions

Communication‑reduction strategies for both task‑parallel (multithreaded) and data‑parallel (distributed‑memory) environments, targeting latency‑bound bottlenecks on modern clusters.
Low‑rank and hierarchical matrix techniques (e.g., H‑matrices, HODLR) that cut the arithmetic complexity of factorization and solve phases without sacrificing numerical stability.
A unified framework that blends the above ideas, enabling a single solver to switch between pure sparse, hybrid sparse‑plus‑low‑rank, or fully compressed modes depending on problem characteristics.
Practical parallelization guidelines for heterogeneous hardware (CPU + GPU, many‑core accelerators), including task‑graph construction, load‑balancing heuristics, and memory‑layout recommendations.
Empirical validation on a suite of benchmark problems (structural mechanics, CFD, kernel ridge regression) that demonstrates up to an order‑of‑magnitude reduction in communication volume and 2–4× speed‑ups over traditional sparse direct solvers on up to 1,024 cores.

Methodology

Algorithmic Redesign – The authors start from a classic multifrontal/supernodal factorization and inject two layers of optimization:
- Task‑parallelism: factorization is expressed as a directed acyclic graph (DAG) where independent fronts can be processed concurrently. Critical‑path analysis and work‑stealing schedulers are used to hide latency.
- Data‑parallelism: the frontal matrices are distributed across MPI ranks, but instead of naïvely sending whole dense fronts, only the essential subblocks (e.g., Schur complements) are communicated, often after low‑rank compression.
Low‑Rank Compression – For each frontal matrix, a truncated SVD or randomized sketch is applied to off‑diagonal blocks, producing a hierarchical representation that dramatically reduces storage and flop counts. The compression tolerance is chosen adaptively to keep the overall solution error within user‑specified bounds.
Hybrid Solver Engine – A runtime decides, on a per‑front basis, whether to keep the block dense, compress it, or replace it with a hierarchical surrogate. This decision is guided by block size, sparsity pattern, and estimated rank.
Implementation on Heterogeneous Nodes – Dense kernels (e.g., BLAS‑3 updates) are offloaded to GPUs when available, while the DAG scheduler runs on the CPU. Memory‑pinned transfers and overlapping communication/computation are employed to keep GPUs fed.
Benchmarking & Profiling – The authors evaluate the solver on both synthetic and real‑world matrices, measuring factorization time, solve time, memory footprint, and communication volume. Comparisons are made against state‑of‑the‑art sparse direct packages such as MUMPS, SuperLU‑DIST, and PaStiX.

Results & Findings

Metric	Traditional Sparse Solver	Low‑Rank‑Enabled Solver (this work)
Factorization time (on 512‑core cluster)	1.8 × baseline	0.5 × (≈ 3.6× faster)
Solve time (per RHS)	0.42 × baseline	0.35 × (≈ 1.2× faster)
Peak memory usage	1.0 × baseline	0.4 × (60 % reduction)
Total bytes communicated	1.0 × baseline	0.2 × (80 % cut)
Accuracy (relative residual)	≤ 10⁻⁸	≤ 10⁻⁸ (within compression tolerance)

Communication savings stem from transmitting only low‑rank factors instead of full dense fronts.
Complexity reduction: For many PDE‑derived matrices, the off‑diagonal rank grows only logarithmically with problem size, turning an O(n³) dense update into O(n log n).
Scalability: Strong‑scaling experiments show near‑linear speed‑up up to 1,024 cores, after which latency dominates unless low‑rank compression is applied.
Robustness: Even for highly ill‑conditioned, indefinite systems (e.g., saddle‑point problems), the solver maintains stability thanks to selective use of dense fronts where compression would be unsafe.

Practical Implications

Faster End‑to‑End Simulations – Engineers can replace the “black‑box” sparse direct solver in their multiphysics codes with this communication‑aware version and expect noticeable reductions in wall‑clock time, especially on large clusters or cloud‑based HPC instances.
Lower Memory Footprint – The hierarchical compression enables solving problems that previously ran out of RAM, opening the door to higher‑resolution meshes or larger training sets in kernel methods.
GPU‑Ready Linear Algebra – By offloading dense BLAS kernels, developers can leverage existing CUDA/ROCm libraries without rewriting their entire factorization pipeline.
Plug‑and‑Play Integration – The solver’s API mirrors that of popular packages (e.g., PETSc’s KSP and PC interfaces), making it straightforward to drop into existing codebases.
Cost Savings on Cloud – Reduced communication translates into lower network‑traffic charges and better utilization of spot‑instance clusters, which is attractive for data‑science workloads that need occasional large solves.

Limitations & Future Work

Rank Estimation Overhead – While adaptive compression is powerful, the cost of estimating ranks (especially with randomized sketches) can become noticeable for very small fronts; smarter heuristics are needed.
Extreme Ill‑Conditioning – For problems where off‑diagonal blocks are genuinely full‑rank (e.g., certain electromagnetic simulations), low‑rank compression offers little benefit, and the solver reverts to dense handling, losing its advantage.
Heterogeneity Portability – The current GPU offload strategy assumes a relatively balanced CPU‑GPU ratio; performance on emerging architectures (e.g., ARM‑based accelerators) remains to be explored.
Exascale Scaling – Moving beyond 10⁴ cores will require deeper integration with runtime systems that can dynamically reshape the DAG and manage fault tolerance.

Future research directions highlighted include: adaptive hierarchical formats that automatically switch between H‑matrix, HODLR, and dense representations; tighter coupling with iterative refinement and mixed‑precision arithmetic; and extending the framework to support distributed‑memory GPUs (NVLink, GPUDirect) for truly petascale factorization.

Authors

Xiaoye Sherry Li
Yang Liu

Paper Information

arXiv ID: 2602.14289v1
Categories: cs.MS, cs.DC, math.NA
Published: February 15, 2026
PDF: Download PDF

[Paper] Parallel Sparse and Data-Sparse Factorization-based Linear Solvers

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Service Orchestration in the Computing Continuum: Structural Challenges and Vision

[Paper] Tight Communication Bounds for Distributed Algorithms in the Quantum Routing Model

[Paper] Co-Design and Evaluation of a CPU-Free MPI GPU Communication Abstraction and Implementation

[Paper] Distributed Semi-Speculative Parallel Anisotropic Mesh Adaptation