[Paper] Distributed Order Recording Techniques for Efficient Record-and-Replay of Multi-threaded Programs

Published: (February 17, 2026 at 03:33 PM EST)
5 min read
Source: arXiv

Source: arXiv - 2602.15995v1

Overview

The paper tackles a long‑standing pain point in high‑performance computing: reliably recording and replaying the execution of OpenMP programs, which are notoriously nondeterministic due to their fine‑grained thread interleavings. By introducing two lightweight “distributed” recording schemes, the authors show how to cut down the massive synchronization overhead that has made scalable OpenMP replay impractical—delivering 2‑5× speedups on real HPC workloads and opening the door to seamless MPI + OpenMP replay.

Key Contributions

  • Distributed Clock (DC) recording – a per‑thread logical clock that captures ordering without forcing global barriers on every shared‑memory access.
  • Distributed Epoch (DE) recording – groups memory operations into epochs, allowing bulk synchronization only at epoch boundaries.
  • Integration with ReOMP – a prototype implementation that demonstrates the techniques on a suite of representative OpenMP benchmarks.
  • Hybrid MPI + OpenMP replay – seamless coupling of the new OpenMP recorder with ReMPI, a scalable MPI record‑and‑replay system, with only a tiny MPI‑scale‑independent overhead.
  • Empirical validation – thorough performance evaluation showing 2‑5× reduction in replay overhead compared with traditional per‑access synchronization approaches.

Methodology

  1. Logical Time per Thread – Each thread maintains its own logical clock (the “distributed clock”). When a thread performs a shared memory operation, it records the operation together with its current clock value, but it does not immediately synchronize with other threads.
  2. Epoch Formation – Threads periodically flush their recorded operations into an epoch (a batch). An epoch ends when a thread reaches a safe point (e.g., a barrier or a library call). At that moment, the thread’s clock is compared with the clocks of other threads to establish a partial order for the whole epoch.
  3. Replay Engine – During replay, the recorded epochs are re‑executed in the same logical order. Because the ordering information is coarse‑grained (epoch‑level) rather than per‑access, the replay engine can let threads run concurrently while still guaranteeing deterministic behavior.
  4. Hybrid Integration – The OpenMP recorder is wrapped inside ReMPI, which already handles MPI‑level nondeterminism. The two layers exchange minimal metadata, so the combined system can replay complex hybrid applications without blowing up runtime or memory usage.

Results & Findings

BenchmarkTraditional per‑access sync (baseline)DC/DE approachSpeedup
Mini‑Ghost (OpenMP)1.8 × slowdown0.4 × slowdown4.5×
LULESH (OpenMP)2.3 × slowdown0.6 × slowdown3.8×
Hybrid LAMMPS (MPI + OpenMP)3.1 × slowdown (MPI+OpenMP)0.9 × slowdown (combined)≈3.4×
  • Overhead: The added runtime for the distributed recording itself is under 5 % on most workloads.
  • Memory footprint: Epoch‑based batching reduces log size by 30‑50 % compared with per‑access logs.
  • Scalability: Adding more OpenMP threads (up to 64) does not cause the overhead to explode, confirming the “distributed” nature of the approach.
  • MPI integration: The combined ReMPI + DC/DE system adds only ~2 % extra time beyond what ReMPI alone incurs, demonstrating that the two layers coexist peacefully.

Practical Implications

  • Faster Debugging Cycles – Developers can now record a problematic run of an OpenMP (or MPI + OpenMP) application and replay it deterministically without paying a prohibitive performance penalty, making root‑cause analysis more feasible in production environments.
  • Automated Testing – Continuous‑integration pipelines for HPC codes can incorporate record‑and‑replay checkpoints, catching nondeterministic bugs that would otherwise slip through.
  • Performance‑Sensitive Production Runs – Since the overhead is modest, the technique can be enabled in long‑running simulations to capture rare events (e.g., race conditions) without significantly extending wall‑clock time.
  • Toolchain Compatibility – The approach plugs into existing OpenMP runtimes (via ReOMP) and MPI recorders (via ReMPI), meaning teams can adopt it without rewriting large portions of their codebase.
  • Future‑Proofing – As exascale systems increasingly blend many‑core OpenMP with distributed MPI, having a unified, low‑overhead replay mechanism will be essential for both correctness verification and performance tuning.

Limitations & Future Work

  • Epoch Granularity Trade‑off – Choosing epoch boundaries is a heuristic; overly large epochs may hide subtle ordering bugs, while very small epochs reduce the performance benefit.
  • Support for All OpenMP Constructs – The prototype focuses on the most common parallel loops and barriers; more exotic features (tasking, reductions with user‑defined operators) need additional handling.
  • Portability to Non‑Linux Environments – The current implementation relies on Linux‑specific tracing facilities; extending it to Windows or macOS runtimes would require extra engineering.
  • Scalability Beyond 64 Threads – Preliminary tests show promising trends, but the authors note that ultra‑large core counts (hundreds of threads per node) may expose synchronization bottlenecks in epoch finalization.
  • Automated Epoch Tuning – Future work could explore adaptive algorithms that dynamically adjust epoch size based on runtime behavior, further minimizing overhead while preserving determinism.

Authors

  • Xiang Fu
  • Shiman Meng
  • Weiping Zhang
  • Luanzheng Guo
  • Kento Sato
  • Dong H. Ahn
  • Ignacio Laguna
  • Gregory L. Lee
  • Martin Schulz

Paper Information

  • arXiv ID: 2602.15995v1
  • Categories: cs.DC
  • Published: February 17, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »