[Paper] On the Challenges of Energy-Efficiency Analysis in HPC Systems: Evaluating Synthetic Benchmarks and Gromacs
Source: arXiv - 2512.03697v1
Overview
The paper investigates why measuring energy efficiency on modern HPC systems is far from straightforward. By comparing synthetic benchmarks with a real‑world scientific application (GROMACS) on two large clusters (Fritz and Alex), the authors expose hidden pitfalls in the way we collect and interpret power data on Intel Ice Lake/Sapphire Rapids CPUs and Nvidia A40/A100 GPUs. Their findings are a wake‑up call for anyone who wants to benchmark “green” performance in a reproducible way.
Key Contributions
- Systematic comparison of synthetic benchmark suites versus a production‑grade molecular‑dynamics code (GROMACS) on heterogeneous CPU‑GPU nodes.
- In‑depth analysis of measurement artefacts introduced by popular profiling tools (LIKWID for CPUs, Nvidia Nsight/PowerAPI for GPUs).
- Identification of common sources of error, such as mismatched sampling intervals, idle‑power baseline drift, and MPI‑level synchronization effects.
- Practical checklist of best‑practice recommendations for reliable energy‑efficiency experiments on current‑generation HPC hardware.
- Open‑source data set (raw power traces, benchmark configurations) released for reproducibility.
Methodology
-
Hardware Platform – Experiments ran on two clusters:
- Fritz: Dual‑socket Intel Ice Lake CPUs + Nvidia A40 GPUs.
- Alex: Dual‑socket Intel Sapphire Rapids CPUs + Nvidia A100 GPUs.
-
Software Stack –
- MPI (OpenMPI) for parallel execution across full CPU sockets.
- GROMACS 2023 (GPU‑offloaded) as the real‑world workload.
- A set of synthetic benchmarks (STREAM, LINPACK, and a custom compute‑bound kernel) to represent “ideal” workloads.
-
Instrumentation –
- LIKWID (per‑core power counters via RAPL) for CPU energy.
- Nvidia profiling tools (NVML, Nsight Systems) for GPU power.
- Measurements sampled at 1 kHz and aggregated per MPI rank.
-
Experimental Design –
- Run each benchmark at multiple problem sizes and MPI process counts (full socket, half socket, hyper‑threaded).
- Record wall‑clock time, total energy, and derived metrics (performance‑per‑watt, Joules‑per‑step).
- Perform “baseline” runs with idle nodes to quantify static power consumption.
-
Analysis Pipeline –
- Align timestamps across CPU and GPU logs.
- Apply statistical outlier filtering (±2σ).
- Compare synthetic vs. GROMACS energy profiles and compute efficiency ratios.
Results & Findings
| Metric | Synthetic Benchmarks | GROMACS (GPU‑offloaded) |
|---|---|---|
| Peak Power (CPU) | ~210 W per socket | ~190 W per socket (lower due to GPU offload) |
| Peak Power (GPU) | N/A | ~250 W (A100) / ~180 W (A40) |
| Performance‑per‑Watt | 2.8 GFLOP/s /W (ideal) | 1.9 GFLOP/s /W (real) |
| Energy per MD step | — | 0.45 J (A100) vs. 0.58 J (A40) |
| Measurement variance | ±1 % (stable) | ±5 % (high variance due to asynchronous GPU kernels) |
- Synthetic benchmarks dramatically over‑estimate efficiency because they keep CPUs and GPUs fully busy, whereas GROMACS exhibits irregular compute/communication phases.
- Power sampling granularity matters: 1 kHz captures GPU power spikes that 10 Hz sampling completely misses, leading to up to 10 % error in energy totals.
- MPI barrier placement can inflate idle power; removing unnecessary synchronizations reduced measured energy by ~3 % without affecting runtime.
- Static power drift (thermal throttling, background OS activity) contributed up to 8 % of total energy on long‑running runs, highlighting the need for baseline correction.
Practical Implications
- Benchmark Selection – Relying solely on synthetic suites for “green” claims can be misleading. Developers should complement them with domain‑specific workloads (e.g., GROMACS, LAMMPS) that stress the same CPU‑GPU interaction patterns.
- Toolchain Awareness – Profilers must be configured for high‑frequency sampling and synchronized timestamps; otherwise, energy budgets derived from them may be off by a noticeable margin.
- Code Optimisation – Reducing unnecessary MPI barriers and overlapping communication with computation yields measurable energy savings, a low‑effort win for many MPI‑based codes.
- Capacity Planning – System administrators can use the provided baseline correction methodology to more accurately predict power caps and cooling requirements for mixed CPU‑GPU workloads.
- Vendor Comparisons – The side‑by‑side A40 vs. A100 results give developers concrete data to justify hardware upgrades when energy cost is a primary concern.
Limitations & Future Work
- Hardware Scope – Only Intel Ice Lake/Sapphire Rapids CPUs and Nvidia A40/A100 GPUs were examined; results may differ on AMD EPYC or newer GPU architectures.
- Single Application – GROMACS is representative of molecular dynamics but not of all HPC domains (e.g., AI training, graph analytics). Extending the study to other codes would strengthen the conclusions.
- Static Power Modeling – The baseline correction assumes a linear drift, which may not hold under extreme thermal conditions; more sophisticated thermal‑power models are needed.
- Future Directions – The authors plan to (1) integrate power‑aware scheduling policies into the MPI runtime, (2) explore automated detection of measurement artefacts, and (3) release a portable “energy‑efficiency test harness” that can be plugged into CI pipelines for HPC software projects.
Authors
- Rafael Ravedutti Lucio Machado
- Jan Eitzinger
- Georg Hager
- Gerhard Wellein
Paper Information
- arXiv ID: 2512.03697v1
- Categories: cs.DC, cs.MS
- Published: December 3, 2025
- PDF: Download PDF