[Paper] Where Do the Joules Go? Diagnosing Inference Energy Consumption

Published: (January 29, 2026 at 01:16 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.22076v1

Overview

The paper Where Do the Joules Go? Diagnosing Inference Energy Consumption offers the first large‑scale, systematic look at how much electricity modern generative‑AI models actually burn during inference. By measuring 46 models across seven tasks on NVIDIA H100 and B200 GPUs, the authors expose staggering energy gaps—up to 25× between different LLM tasks and more than 100× between video‑generation and image‑generation workloads. Their work goes beyond “what” the numbers are; it builds a diagnostic framework that ties energy use to hidden factors like memory traffic and GPU utilization, giving developers a roadmap for energy‑aware optimization.

Key Contributions

  • Comprehensive measurement suite: 1,858 configuration points covering 46 models (LLMs, diffusion, GANs, etc.) on two state‑of‑the‑art GPUs.
  • Empirical energy taxonomy: Quantifies how task type, model size, batch size, precision, and hardware choice each affect inference energy, revealing order‑of‑magnitude differences.
  • Diagnostic framework: Introduces a layered model that maps observable metrics (time, power) to latent drivers (memory bandwidth, compute utilization, kernel efficiency).
  • Throughput‑per‑watt analysis: Extends the framework to the “performance per watt” metric that datacenter operators care about for cost and sustainability.
  • Open‑source tooling & dataset: Releases the measurement scripts and raw logs, enabling reproducibility and further community research.

Methodology

  1. Benchmark selection – The authors chose a diverse set of generative‑AI workloads (text generation, summarization, image diffusion, video synthesis, etc.) and representative model families (GPT‑style LLMs, Stable Diffusion, VQ‑GAN, etc.).
  2. Configuration sweep – For each model they varied batch size, precision (FP16/FP32/BF16), and inference mode (eager vs. compiled) to generate 1,858 distinct runs.
  3. Instrumentation – Power draw was captured via NVIDIA’s NVML API at 1 kHz resolution, while timestamps, GPU utilization, memory usage, and kernel statistics were logged with Nsight Systems.
  4. Normalization – Energy (Joules) was computed as the integral of power over the inference interval, then normalized by the number of generated tokens/images/frames to enable apples‑to‑apples comparisons.
  5. Framework construction – Using regression and correlation analysis, the authors identified latent variables (e.g., memory‑bound vs. compute‑bound phases) that best explain observed energy variance.

The approach is deliberately hardware‑agnostic: the same pipeline can be applied to any GPU that exposes power and performance counters, making the study reproducible on future accelerator generations.

Results & Findings

FactorObserved Energy Impact
LLM task type25× energy difference between, e.g., code generation vs. chat completion (same model size).
Media modalityVideo generation consumes >100× the energy of single‑image diffusion for comparable visual quality.
GPU utilizationLow utilization (≤30 %) leads to 3–5× higher Joules per token compared to well‑packed batches.
PrecisionSwitching from FP32 to BF16 cuts energy by ~30 % with negligible quality loss for most tasks.
Batch sizeIncreasing batch size up to the GPU’s memory limit yields near‑linear energy efficiency gains, but oversubscribing memory causes spikes due to paging.
HardwareH100 outperforms B200 by ~2× in throughput‑per‑watt for large LLMs, but the gap narrows for smaller diffusion models.

The diagnostic framework shows that memory bandwidth pressure is the primary driver of high energy consumption in video synthesis, while compute saturation dominates LLM token generation. Moreover, the authors demonstrate that throughput‑per‑watt can be maximized by co‑optimizing batch size, precision, and kernel fusion to keep both compute and memory pipelines busy.

Practical Implications

  • Model‑serving engineers can immediately apply the batch‑size‑and‑precision guidelines to cut operational costs without sacrificing quality.
  • Cloud providers gain a quantitative basis for pricing “energy‑aware” inference endpoints, potentially offering cheaper rates for workloads that stay in the compute‑bound regime.
  • Hardware architects receive concrete evidence that future GPUs should prioritize balanced memory bandwidth and on‑chip cache for video‑generation pipelines.
  • Sustainability teams can use the throughput‑per‑watt metric to benchmark datacenter upgrades and justify investments in newer accelerators.
  • Framework developers (e.g., PyTorch, TensorFlow) can integrate the authors’ profiling hooks to surface latent utilization metrics in their performance dashboards, giving developers actionable feedback during model deployment.

In short, the paper equips developers with a diagnostic checklist: measure power, monitor GPU utilization, adjust batch size/precision, and target the right hardware for the workload’s memory vs. compute profile.

Limitations & Future Work

  • Hardware scope: The study is limited to NVIDIA H100 and B200 GPUs; results may differ on AMD or specialized ASICs.
  • Inference‑only focus: Training energy dynamics are not explored, though many of the same latent factors likely apply.
  • Static workloads: Real‑world serving stacks often involve request multiplexing and dynamic batching, which could introduce additional variability not captured in the controlled experiments.
  • Model diversity: While 46 models are extensive, emerging multimodal transformers and retrieval‑augmented generation were not included.

The authors suggest extending the framework to heterogeneous clusters, incorporating dynamic workload scheduling, and exploring energy‑aware compiler optimizations as promising directions for follow‑up research.

Authors

  • Jae-Won Chung
  • Ruofan Wu
  • Jeff J. Ma
  • Mosharaf Chowdhury

Paper Information

  • arXiv ID: 2601.22076v1
  • Categories: cs.LG, cs.DC
  • Published: January 29, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »