[Paper] On the Challenges of Energy-Efficiency Analysis in HPC Systems: Evaluating Synthetic Benchmarks and Gromacs

Published: 2 months ago (December 3, 2025 at 06:40 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.03697v1

Overview

The paper investigates why measuring energy efficiency on modern HPC systems is far from straightforward. By comparing synthetic benchmarks with a real‑world scientific application (GROMACS) on two large clusters (Fritz and Alex), the authors expose hidden pitfalls in the way we collect and interpret power data on Intel Ice Lake/Sapphire Rapids CPUs and Nvidia A40/A100 GPUs. Their findings are a wake‑up call for anyone who wants to benchmark “green” performance in a reproducible way.

Key Contributions

Systematic comparison of synthetic benchmark suites versus a production‑grade molecular‑dynamics code (GROMACS) on heterogeneous CPU‑GPU nodes.
In‑depth analysis of measurement artefacts introduced by popular profiling tools (LIKWID for CPUs, Nvidia Nsight/PowerAPI for GPUs).
Identification of common sources of error, such as mismatched sampling intervals, idle‑power baseline drift, and MPI‑level synchronization effects.
Practical checklist of best‑practice recommendations for reliable energy‑efficiency experiments on current‑generation HPC hardware.
Open‑source data set (raw power traces, benchmark configurations) released for reproducibility.

Methodology

Hardware Platform – Experiments ran on two clusters:
- Fritz: Dual‑socket Intel Ice Lake CPUs + Nvidia A40 GPUs.
- Alex: Dual‑socket Intel Sapphire Rapids CPUs + Nvidia A100 GPUs.
Software Stack –
- MPI (OpenMPI) for parallel execution across full CPU sockets.
- GROMACS 2023 (GPU‑offloaded) as the real‑world workload.
- A set of synthetic benchmarks (STREAM, LINPACK, and a custom compute‑bound kernel) to represent “ideal” workloads.
Instrumentation –
- LIKWID (per‑core power counters via RAPL) for CPU energy.
- Nvidia profiling tools (NVML, Nsight Systems) for GPU power.
- Measurements sampled at 1 kHz and aggregated per MPI rank.
Experimental Design –
- Run each benchmark at multiple problem sizes and MPI process counts (full socket, half socket, hyper‑threaded).
- Record wall‑clock time, total energy, and derived metrics (performance‑per‑watt, Joules‑per‑step).
- Perform “baseline” runs with idle nodes to quantify static power consumption.
Analysis Pipeline –
- Align timestamps across CPU and GPU logs.
- Apply statistical outlier filtering (±2σ).
- Compare synthetic vs. GROMACS energy profiles and compute efficiency ratios.

Results & Findings

Metric	Synthetic Benchmarks	GROMACS (GPU‑offloaded)
Peak Power (CPU)	~210 W per socket	~190 W per socket (lower due to GPU offload)
Peak Power (GPU)	N/A	~250 W (A100) / ~180 W (A40)
Performance‑per‑Watt	2.8 GFLOP/s /W (ideal)	1.9 GFLOP/s /W (real)
Energy per MD step	—	0.45 J (A100) vs. 0.58 J (A40)
Measurement variance	±1 % (stable)	±5 % (high variance due to asynchronous GPU kernels)

Synthetic benchmarks dramatically over‑estimate efficiency because they keep CPUs and GPUs fully busy, whereas GROMACS exhibits irregular compute/communication phases.
Power sampling granularity matters: 1 kHz captures GPU power spikes that 10 Hz sampling completely misses, leading to up to 10 % error in energy totals.
MPI barrier placement can inflate idle power; removing unnecessary synchronizations reduced measured energy by ~3 % without affecting runtime.
Static power drift (thermal throttling, background OS activity) contributed up to 8 % of total energy on long‑running runs, highlighting the need for baseline correction.

Practical Implications

Benchmark Selection – Relying solely on synthetic suites for “green” claims can be misleading. Developers should complement them with domain‑specific workloads (e.g., GROMACS, LAMMPS) that stress the same CPU‑GPU interaction patterns.
Toolchain Awareness – Profilers must be configured for high‑frequency sampling and synchronized timestamps; otherwise, energy budgets derived from them may be off by a noticeable margin.
Code Optimisation – Reducing unnecessary MPI barriers and overlapping communication with computation yields measurable energy savings, a low‑effort win for many MPI‑based codes.
Capacity Planning – System administrators can use the provided baseline correction methodology to more accurately predict power caps and cooling requirements for mixed CPU‑GPU workloads.
Vendor Comparisons – The side‑by‑side A40 vs. A100 results give developers concrete data to justify hardware upgrades when energy cost is a primary concern.

Limitations & Future Work

Hardware Scope – Only Intel Ice Lake/Sapphire Rapids CPUs and Nvidia A40/A100 GPUs were examined; results may differ on AMD EPYC or newer GPU architectures.
Single Application – GROMACS is representative of molecular dynamics but not of all HPC domains (e.g., AI training, graph analytics). Extending the study to other codes would strengthen the conclusions.
Static Power Modeling – The baseline correction assumes a linear drift, which may not hold under extreme thermal conditions; more sophisticated thermal‑power models are needed.
Future Directions – The authors plan to (1) integrate power‑aware scheduling policies into the MPI runtime, (2) explore automated detection of measurement artefacts, and (3) release a portable “energy‑efficiency test harness” that can be plugged into CI pipelines for HPC software projects.

Authors

Rafael Ravedutti Lucio Machado
Jan Eitzinger
Georg Hager
Gerhard Wellein

Paper Information

arXiv ID: 2512.03697v1
Categories: cs.DC, cs.MS
Published: December 3, 2025
PDF: Download PDF

[Paper] On the Challenges of Energy-Efficiency Analysis in HPC Systems: Evaluating Synthetic Benchmarks and Gromacs

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Metronome: Differentiated Delay Scheduling for Serverless Functions

[Paper] Are Bus-Mounted Edge Servers Feasible?

[Paper] Compiler-supported reduced precision and AoS-SoA transformations for heterogeneous hardware

[Paper] FedGMR: Federated Learning with Gradual Model Restoration under Asynchrony and Model Heterogeneity