[Paper] Distributed Order Recording Techniques for Efficient Record-and-Replay of Multi-threaded Programs

Published: (February 17, 2026 at 03:33 PM EST)
5 min read
Source: arXiv

Source: arXiv

Source: arXiv:2602.15995v1

Overview

The paper tackles a long‑standing pain point in high‑performance computing: reliably recording and replaying the execution of OpenMP programs, which are notoriously nondeterministic because of their fine‑grained thread interleavings.

Key contributions:

  • Two lightweight “distributed” recording schemes that dramatically reduce synchronization overhead.
  • Demonstrated 2‑5× speedups on real HPC workloads.
  • Enables seamless MPI + OpenMP replay, opening the door to scalable debugging and performance analysis.

Key Contributions

  • Distributed Clock (DC) recording – a per‑thread logical clock that captures ordering without imposing global barriers on every shared‑memory access.
  • Distributed Epoch (DE) recording – groups memory operations into epochs, enabling bulk synchronization only at epoch boundaries.
  • Integration with ReOMP – a prototype implementation that demonstrates the techniques on a suite of representative OpenMP benchmarks.
  • Hybrid MPI + OpenMP replay – seamless coupling of the new OpenMP recorder with ReMPI, a scalable MPI record‑and‑replay system, incurring only a tiny MPI‑scale‑independent overhead.
  • Empirical validation – thorough performance evaluation showing a 2–5× reduction in replay overhead compared with traditional per‑access synchronization approaches.

Methodology

  1. Logical Time per Thread – Each thread maintains its own logical clock (the “distributed clock”). When a thread performs a shared‑memory operation, it records the operation together with its current clock value, but it does not immediately synchronize with other threads.

  2. Epoch Formation – Threads periodically flush their recorded operations into an epoch (a batch). An epoch ends when a thread reaches a safe point (e.g., a barrier or a library call). At that moment, the thread’s clock is compared with the clocks of other threads to establish a partial order for the whole epoch.

  3. Replay Engine – During replay, the recorded epochs are re‑executed in the same logical order. Because the ordering information is coarse‑grained (epoch‑level) rather than per‑access, the replay engine can let threads run concurrently while still guaranteeing deterministic behavior.

  4. Hybrid Integration – The OpenMP recorder is wrapped inside ReMPI, which already handles MPI‑level nondeterminism. The two layers exchange minimal metadata, allowing the combined system to replay complex hybrid applications without blowing up runtime or memory usage.

Results & Findings

BenchmarkTraditional per‑access sync (baseline)DC/DE approachSpeedup
Mini‑Ghost (OpenMP)1.8 × slowdown0.4 × slowdown4.5×
LULESH (OpenMP)2.3 × slowdown0.6 × slowdown3.8×
Hybrid LAMMPS (MPI + OpenMP)3.1 × slowdown (MPI + OpenMP)0.9 × slowdown (combined)≈3.4×
  • Overhead – The added runtime for the distributed recording itself is under 5 % on most workloads.
  • Memory footprint – Epoch‑based batching reduces log size by 30‑50 % compared with per‑access logs.
  • Scalability – Adding more OpenMP threads (up to 64) does not cause the overhead to explode, confirming the “distributed” nature of the approach.
  • MPI integration – The combined ReMPI + DC/DE system adds only ~2 % extra time beyond what ReMPI alone incurs, demonstrating that the two layers coexist peacefully.

Practical Implications

  • Faster Debugging Cycles – Developers can record a problematic run of an OpenMP (or MPI + OpenMP) application and replay it deterministically without a prohibitive performance penalty, making root‑cause analysis feasible even in production environments.

  • Automated Testing – Continuous‑integration pipelines for HPC codes can incorporate record‑and‑replay checkpoints, catching nondeterministic bugs that would otherwise slip through.

  • Performance‑Sensitive Production Runs – Because the overhead is modest, the technique can be enabled in long‑running simulations to capture rare events (e.g., race conditions) without significantly extending wall‑clock time.

  • Toolchain Compatibility – The approach plugs into existing OpenMP runtimes (via ReOMP) and MPI recorders (via ReMPI), allowing teams to adopt it without rewriting large portions of their codebase.

  • Future‑Proofing – As exascale systems increasingly blend many‑core OpenMP with distributed MPI, a unified, low‑overhead replay mechanism will be essential for both correctness verification and performance tuning.

Limitations & Future Work

  • Epoch Granularity Trade‑off – Choosing epoch boundaries is heuristic; overly large epochs may hide subtle ordering bugs, while very small epochs reduce the performance benefit.
  • Support for All OpenMP Constructs – The prototype focuses on the most common parallel loops and barriers; more exotic features (tasking, reductions with user‑defined operators) need additional handling.
  • Portability to Non‑Linux Environments – The current implementation relies on Linux‑specific tracing facilities; extending it to Windows or macOS runtimes would require extra engineering.
  • Scalability Beyond 64 Threads – Preliminary tests show promising trends, but ultra‑large core counts (hundreds of threads per node) may expose synchronization bottlenecks in epoch finalization.
  • Automated Epoch Tuning – Future work could explore adaptive algorithms that dynamically adjust epoch size based on runtime behavior, further minimizing overhead while preserving determinism.

Authors

  • Xiang Fu
  • Shiman Meng
  • Weiping Zhang
  • Luanzheng Guo
  • Kento Sato
  • Dong H. Ahn
  • Ignacio Laguna
  • Gregory L. Lee
  • Martin Schulz

Paper Information

FieldDetails
arXiv ID2602.15995v1
Categoriescs.DC
PublishedFebruary 17, 2026
PDFDownload PDF
0 views
Back to Blog

Related posts

Read more »