[Paper] Trace-based, time-resolved analysis of MPI application performance using standard metrics

Published: 4 days ago (December 1, 2025 at 10:08 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.01764v1

Overview

Performance engineers need to understand when and where an MPI program stalls, but traditional trace visualisers quickly become unwieldy as applications scale. This paper introduces a lightweight, post‑mortem analysis that slices MPI execution traces into time windows and computes familiar performance metrics (load‑balance, serialization, transfer efficiency) for each window. The result is a time‑resolved view that uncovers transient bottlenecks that would be hidden by a single aggregated number.

Key Contributions

Time‑windowed metric computation: Extends classic MPI metrics (e.g., load‑balance, serialization) to a per‑segment basis, enabling “zoom‑in” on performance over the run’s timeline.
Robust trace preprocessing: Handles common trace artefacts such as clock drift, missing or mismatched MPI events, and reconstructs critical execution paths automatically.
Implementation on Paraver traces: A practical, open‑source toolchain that works on standard Paraver trace files without requiring instrumentation changes.
Demonstrated scalability: Shows that the approach remains lightweight even for large‑scale runs where full visual inspection is infeasible.
Empirical validation: Uses a synthetic benchmark plus two real‑world scientific codes (LaMEM and ls1‑MarDyn) to illustrate how transient inefficiencies are revealed.

Methodology

Trace ingestion: The tool reads Paraver trace files after the application finishes (post‑mortem).
Segmentation: Execution time is divided into fixed‑size windows or adaptive windows that grow/shrink based on activity density.
Event normalization: Clock inconsistencies across MPI ranks are corrected, and unmatched MPI_Send/MPI_Recv pairs are reconciled using a heuristic matching algorithm.
Metric calculation per window:
- Load‑balance: Ratio of the longest to the average compute time among ranks.
- Serialization: Fraction of time spent waiting for a single rank to finish a collective operation.
- Transfer efficiency: Amount of useful data transferred divided by total communication time.
Path reconstruction: Critical execution paths (e.g., the longest‑running rank) are identified for each window, allowing developers to pinpoint the exact code region responsible for a spike.
Output: A compact CSV/JSON file containing metric values per window, ready for plotting or further analysis.

Results & Findings

Synthetic benchmark: The time‑resolved metrics correctly identified a deliberately injected communication stall that disappeared in the global averages.
LaMEM (geophysical simulation): A short‑lived load‑imbalance caused by an irregular mesh refinement step was visible only in the per‑window load‑balance plot, prompting a simple redistribution of work that cut total runtime by ~6 %.
ls1‑MarDyn (molecular dynamics): Transfer efficiency dropped sharply during a specific phase where particle exchange patterns changed; the authors adjusted the domain decomposition, improving bandwidth utilization by ~12 %.
Scalability: Processing a 200 GB trace (10 k MPI ranks) required < 15 minutes on a modest 8‑core workstation, demonstrating that the approach is far less resource‑hungry than full trace visualisation.

Practical Implications

Fast “post‑mortem” debugging: Developers can run their usual profiling suite, then apply this tool to get a timeline of where MPI inefficiencies occur, without re‑instrumenting the code.
Guided optimization: By correlating metric spikes with source‑level annotations (e.g., using #pragma markers), teams can prioritize code sections that actually hurt performance, saving engineering time.
Continuous integration: The lightweight CSV output can be fed into CI pipelines to detect regressions in MPI efficiency across builds.
Scalable HPC monitoring: System administrators can aggregate per‑window metrics from many jobs to spot systemic issues (e.g., network congestion at certain times of day) without storing massive visual trace archives.

Limitations & Future Work

Dependence on Paraver format: The current prototype only parses Paraver traces; extending support to other popular formats (e.g., OTF2, TAU) would broaden adoption.
Window granularity trade‑off: Very small windows increase noise, while large windows may still hide brief spikes; adaptive window strategies need further refinement.
Limited to MPI‑only metrics: The method does not yet incorporate CPU‑side metrics (e.g., cache misses) or GPU activity, which are increasingly relevant in hybrid codes.
Automation of root‑cause mapping: Future work could integrate static code analysis to automatically map metric anomalies to source lines, reducing manual annotation effort.

Bottom line: By turning massive MPI traces into a series of easy‑to‑interpret metric snapshots, this research gives developers a practical, scalable lens for spotting the “when” and “why” of performance hiccups—an essential tool as HPC applications continue to grow in size and complexity.

Authors

Kingshuk Haldar

Paper Information

arXiv ID: 2512.01764v1
Categories: cs.DC
Published: December 1, 2025
PDF: Download PDF

[Paper] Trace-based, time-resolved analysis of MPI application performance using standard metrics

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Federated Learning for Terahertz Wireless Communication

[Paper] FLEX: Leveraging FPGA-CPU Synergy for Mixed-Cell-Height Legalization Acceleration

[Paper] Offloading to CXL-based Computational Memory

[Paper] A Structure-Aware Irregular Blocking Method for Sparse LU Factorization