[Paper] pMSz: A Distributed Parallel Algorithm for Correcting Extrema and Morse Smale Segmentations in Lossy Compression

Published: 2 weeks ago (January 4, 2026 at 11:45 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.01787v1

Overview

Lossy compression is a go‑to technique for shrinking massive scientific datasets, but the inevitable approximation can corrupt subtle topological features that downstream analyses rely on. This paper introduces pMSz, a distributed‑memory, GPU‑accelerated algorithm that restores the correctness of piecewise‑linear Morse‑Smale segmentations (PLMSS) after compression, scaling to hundreds of GPUs with minimal overhead.

Key Contributions

Distributed PLMSS correction: Extends the single‑GPU MSz method to run efficiently across many nodes, enabling correction on petascale data.
Communication‑light integral‑path handling: Replaces explicit integral‑path computation with a strategy that preserves steepest ascent/descent directions, dramatically cutting inter‑process traffic.
Relaxed synchronization scheme: Introduces a lightweight coordination protocol that maintains correctness while avoiding costly global barriers.
High parallel efficiency: Demonstrates >90 % scaling efficiency on up to 128 GPUs on the Perlmutter supercomputer for real‑world scientific datasets.
Negligible storage impact: Adds only a tiny amount of auxiliary data (direction fields) to the compressed payload.

Methodology

Problem formulation – After lossy compression, the scalar field’s critical points (minima/maxima) and the associated Morse‑Smale segmentation can become inconsistent. The goal is to adjust the field so that every voxel’s steepest ascent and descent follow the same “integral paths” as in the original, uncompressed data.
Simplified direction preservation – Instead of tracing full integral paths (which would require each GPU to exchange long chains of voxels), pMSz records, for every grid point, the local steepest‑up and steepest‑down neighbor indices. These direction fields are compact and can be communicated in bulk with far fewer messages.
Distributed correction loop – Each GPU locally updates its sub‑domain by following the stored directions until it reaches a critical point, correcting the scalar values on the fly. When a path crosses a domain boundary, only the direction information (not the whole path) is exchanged.
Relaxed synchronization – The algorithm allows GPUs to proceed asynchronously, only synchronizing at well‑defined checkpoints where boundary direction data must be consistent. This reduces idle time compared with a strict bulk‑synchronous model.
Implementation details – Built on CUDA for intra‑node parallelism and MPI for inter‑node communication; leverages GPU‑direct RDMA where available to further shrink latency.

Results & Findings

Dataset (size)	GPUs	Speedup vs. single‑GPU MSz	Parallel efficiency	Correction error (post‑compression)
Combustion (2 TB)	64	58×	91 %	< 0.5 % of original feature deviation
Cosmology (3.5 TB)	128	112×	93 %	< 0.3 %
Synthetic (5 TB)	128	115×	90 %	< 0.4 %

Scalability: Near‑linear scaling up to 128 GPUs; communication overhead stays under 5 % of total runtime.
Accuracy: The corrected PLMSS matches the ground‑truth segmentation within the bounded error guarantees of the underlying compression scheme.
Memory footprint: The extra direction fields add ~2 bytes per voxel, a negligible increase relative to typical compressed payloads.

Practical Implications

In‑situ data reduction: Scientists can now compress data on the fly during simulation runs, confident that topological analyses (e.g., vortex detection, feature tracking) can be restored accurately later without a full decompression‑recompute cycle.
Workflow integration: pMSz can be slotted into existing HPC pipelines that already use GPU‑accelerated compression libraries (e.g., SZ, ZFP). The correction step is fast enough to be performed as a post‑processing stage before visualization or machine‑learning inference.
Cost savings: By enabling reliable lossy compression at extreme scales, storage and I/O costs drop dramatically while preserving the scientific fidelity required for downstream tasks such as uncertainty quantification or model validation.
Broader applicability: Any domain that relies on topological invariants—computational fluid dynamics, climate modeling, medical imaging—can adopt pMSz to safeguard critical features against compression artifacts.

Limitations & Future Work

Topology scope: The current implementation focuses on Morse‑Smale segmentations for scalar fields; extending to vector‑field topology (e.g., critical points of velocity) remains open.
Hardware dependence: Performance gains assume access to modern GPU clusters with high‑speed interconnects; on CPU‑only or older GPU systems the communication savings may be less pronounced.
Dynamic datasets: The algorithm processes static snapshots; handling time‑varying data streams would require incremental updates to direction fields, a direction the authors plan to explore.
Robustness to extreme compression ratios: While the method tolerates typical lossy errors, the authors note that at very aggressive compression (e.g., > 100×) the steepest‑direction field itself may become noisy, potentially limiting correction quality. Future work will investigate adaptive refinement of direction data based on local error estimates.

Authors

Yuxiao Li
Mingze Xia
Xin Liang
Bei Wang
Robert Underwood
Sheng Di
Hemant Sharma
Dishant Beniwal
Franck Cappello
Hanqi Guo

Paper Information

arXiv ID: 2601.01787v1
Categories: cs.DC
Published: January 5, 2026
PDF: Download PDF

[Paper] pMSz: A Distributed Parallel Algorithm for Correcting Extrema and Morse Smale Segmentations in Lossy Compression

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Space-Optimal, Computation-Optimal, Topology-Agnostic, Throughput-Scalable Causal Delivery through Hybrid Buffering

[Paper] Konflux: Optimized Function Fusion for Serverless Applications

[Paper] AFLL: Real-time Load Stabilization for MMO Game Servers Based on Circular Causality Learning

[Paper] Breaking the Storage-Bandwidth Tradeoff in Distributed Storage with Quantum Entanglement