[Paper] Trace-based, time-resolved analysis of MPI application performance using standard metrics
Source: arXiv - 2512.01764v1
Overview
Performance engineers need to understand when and where an MPI program stalls, but traditional trace visualisers quickly become unwieldy as applications scale. This paper introduces a lightweight, post‑mortem analysis that slices MPI execution traces into time windows and computes familiar performance metrics (load‑balance, serialization, transfer efficiency) for each window. The result is a time‑resolved view that uncovers transient bottlenecks that would be hidden by a single aggregated number.
Key Contributions
- Time‑windowed metric computation: Extends classic MPI metrics (e.g., load‑balance, serialization) to a per‑segment basis, enabling “zoom‑in” on performance over the run’s timeline.
- Robust trace preprocessing: Handles common trace artefacts such as clock drift, missing or mismatched MPI events, and reconstructs critical execution paths automatically.
- Implementation on Paraver traces: A practical, open‑source toolchain that works on standard Paraver trace files without requiring instrumentation changes.
- Demonstrated scalability: Shows that the approach remains lightweight even for large‑scale runs where full visual inspection is infeasible.
- Empirical validation: Uses a synthetic benchmark plus two real‑world scientific codes (LaMEM and ls1‑MarDyn) to illustrate how transient inefficiencies are revealed.
Methodology
- Trace ingestion: The tool reads Paraver trace files after the application finishes (post‑mortem).
- Segmentation: Execution time is divided into fixed‑size windows or adaptive windows that grow/shrink based on activity density.
- Event normalization: Clock inconsistencies across MPI ranks are corrected, and unmatched
MPI_Send/MPI_Recvpairs are reconciled using a heuristic matching algorithm. - Metric calculation per window:
- Load‑balance: Ratio of the longest to the average compute time among ranks.
- Serialization: Fraction of time spent waiting for a single rank to finish a collective operation.
- Transfer efficiency: Amount of useful data transferred divided by total communication time.
- Path reconstruction: Critical execution paths (e.g., the longest‑running rank) are identified for each window, allowing developers to pinpoint the exact code region responsible for a spike.
- Output: A compact CSV/JSON file containing metric values per window, ready for plotting or further analysis.
Results & Findings
- Synthetic benchmark: The time‑resolved metrics correctly identified a deliberately injected communication stall that disappeared in the global averages.
- LaMEM (geophysical simulation): A short‑lived load‑imbalance caused by an irregular mesh refinement step was visible only in the per‑window load‑balance plot, prompting a simple redistribution of work that cut total runtime by ~6 %.
- ls1‑MarDyn (molecular dynamics): Transfer efficiency dropped sharply during a specific phase where particle exchange patterns changed; the authors adjusted the domain decomposition, improving bandwidth utilization by ~12 %.
- Scalability: Processing a 200 GB trace (10 k MPI ranks) required < 15 minutes on a modest 8‑core workstation, demonstrating that the approach is far less resource‑hungry than full trace visualisation.
Practical Implications
- Fast “post‑mortem” debugging: Developers can run their usual profiling suite, then apply this tool to get a timeline of where MPI inefficiencies occur, without re‑instrumenting the code.
- Guided optimization: By correlating metric spikes with source‑level annotations (e.g., using
#pragmamarkers), teams can prioritize code sections that actually hurt performance, saving engineering time. - Continuous integration: The lightweight CSV output can be fed into CI pipelines to detect regressions in MPI efficiency across builds.
- Scalable HPC monitoring: System administrators can aggregate per‑window metrics from many jobs to spot systemic issues (e.g., network congestion at certain times of day) without storing massive visual trace archives.
Limitations & Future Work
- Dependence on Paraver format: The current prototype only parses Paraver traces; extending support to other popular formats (e.g., OTF2, TAU) would broaden adoption.
- Window granularity trade‑off: Very small windows increase noise, while large windows may still hide brief spikes; adaptive window strategies need further refinement.
- Limited to MPI‑only metrics: The method does not yet incorporate CPU‑side metrics (e.g., cache misses) or GPU activity, which are increasingly relevant in hybrid codes.
- Automation of root‑cause mapping: Future work could integrate static code analysis to automatically map metric anomalies to source lines, reducing manual annotation effort.
Bottom line: By turning massive MPI traces into a series of easy‑to‑interpret metric snapshots, this research gives developers a practical, scalable lens for spotting the “when” and “why” of performance hiccups—an essential tool as HPC applications continue to grow in size and complexity.
Authors
- Kingshuk Haldar
Paper Information
- arXiv ID: 2512.01764v1
- Categories: cs.DC
- Published: December 1, 2025
- PDF: Download PDF