[Paper] On the Challenges of Energy-Efficiency Analysis in HPC Systems: Evaluating Synthetic Benchmarks and Gromacs

Published: (December 3, 2025 at 06:40 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.03697v1

Overview

The paper investigates why measuring energy efficiency on modern HPC systems is far from straightforward. By comparing synthetic benchmarks with a real‑world scientific application (GROMACS) on two large clusters (Fritz and Alex), the authors expose hidden pitfalls in the way we collect and interpret power data on Intel Ice Lake/Sapphire Rapids CPUs and Nvidia A40/A100 GPUs. Their findings are a wake‑up call for anyone who wants to benchmark “green” performance in a reproducible way.

Key Contributions

  • Systematic comparison of synthetic benchmark suites versus a production‑grade molecular‑dynamics code (GROMACS) on heterogeneous CPU‑GPU nodes.
  • In‑depth analysis of measurement artefacts introduced by popular profiling tools (LIKWID for CPUs, Nvidia Nsight/PowerAPI for GPUs).
  • Identification of common sources of error, such as mismatched sampling intervals, idle‑power baseline drift, and MPI‑level synchronization effects.
  • Practical checklist of best‑practice recommendations for reliable energy‑efficiency experiments on current‑generation HPC hardware.
  • Open‑source data set (raw power traces, benchmark configurations) released for reproducibility.

Methodology

  1. Hardware Platform – Experiments ran on two clusters:

    • Fritz: Dual‑socket Intel Ice Lake CPUs + Nvidia A40 GPUs.
    • Alex: Dual‑socket Intel Sapphire Rapids CPUs + Nvidia A100 GPUs.
  2. Software Stack

    • MPI (OpenMPI) for parallel execution across full CPU sockets.
    • GROMACS 2023 (GPU‑offloaded) as the real‑world workload.
    • A set of synthetic benchmarks (STREAM, LINPACK, and a custom compute‑bound kernel) to represent “ideal” workloads.
  3. Instrumentation

    • LIKWID (per‑core power counters via RAPL) for CPU energy.
    • Nvidia profiling tools (NVML, Nsight Systems) for GPU power.
    • Measurements sampled at 1 kHz and aggregated per MPI rank.
  4. Experimental Design

    • Run each benchmark at multiple problem sizes and MPI process counts (full socket, half socket, hyper‑threaded).
    • Record wall‑clock time, total energy, and derived metrics (performance‑per‑watt, Joules‑per‑step).
    • Perform “baseline” runs with idle nodes to quantify static power consumption.
  5. Analysis Pipeline

    • Align timestamps across CPU and GPU logs.
    • Apply statistical outlier filtering (±2σ).
    • Compare synthetic vs. GROMACS energy profiles and compute efficiency ratios.

Results & Findings

MetricSynthetic BenchmarksGROMACS (GPU‑offloaded)
Peak Power (CPU)~210 W per socket~190 W per socket (lower due to GPU offload)
Peak Power (GPU)N/A~250 W (A100) / ~180 W (A40)
Performance‑per‑Watt2.8 GFLOP/s /W (ideal)1.9 GFLOP/s /W (real)
Energy per MD step0.45 J (A100) vs. 0.58 J (A40)
Measurement variance±1 % (stable)±5 % (high variance due to asynchronous GPU kernels)
  • Synthetic benchmarks dramatically over‑estimate efficiency because they keep CPUs and GPUs fully busy, whereas GROMACS exhibits irregular compute/communication phases.
  • Power sampling granularity matters: 1 kHz captures GPU power spikes that 10 Hz sampling completely misses, leading to up to 10 % error in energy totals.
  • MPI barrier placement can inflate idle power; removing unnecessary synchronizations reduced measured energy by ~3 % without affecting runtime.
  • Static power drift (thermal throttling, background OS activity) contributed up to 8 % of total energy on long‑running runs, highlighting the need for baseline correction.

Practical Implications

  • Benchmark Selection – Relying solely on synthetic suites for “green” claims can be misleading. Developers should complement them with domain‑specific workloads (e.g., GROMACS, LAMMPS) that stress the same CPU‑GPU interaction patterns.
  • Toolchain Awareness – Profilers must be configured for high‑frequency sampling and synchronized timestamps; otherwise, energy budgets derived from them may be off by a noticeable margin.
  • Code Optimisation – Reducing unnecessary MPI barriers and overlapping communication with computation yields measurable energy savings, a low‑effort win for many MPI‑based codes.
  • Capacity Planning – System administrators can use the provided baseline correction methodology to more accurately predict power caps and cooling requirements for mixed CPU‑GPU workloads.
  • Vendor Comparisons – The side‑by‑side A40 vs. A100 results give developers concrete data to justify hardware upgrades when energy cost is a primary concern.

Limitations & Future Work

  • Hardware Scope – Only Intel Ice Lake/Sapphire Rapids CPUs and Nvidia A40/A100 GPUs were examined; results may differ on AMD EPYC or newer GPU architectures.
  • Single Application – GROMACS is representative of molecular dynamics but not of all HPC domains (e.g., AI training, graph analytics). Extending the study to other codes would strengthen the conclusions.
  • Static Power Modeling – The baseline correction assumes a linear drift, which may not hold under extreme thermal conditions; more sophisticated thermal‑power models are needed.
  • Future Directions – The authors plan to (1) integrate power‑aware scheduling policies into the MPI runtime, (2) explore automated detection of measurement artefacts, and (3) release a portable “energy‑efficiency test harness” that can be plugged into CI pipelines for HPC software projects.

Authors

  • Rafael Ravedutti Lucio Machado
  • Jan Eitzinger
  • Georg Hager
  • Gerhard Wellein

Paper Information

  • arXiv ID: 2512.03697v1
  • Categories: cs.DC, cs.MS
  • Published: December 3, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »