[Paper] Where Do the Joules Go? Diagnosing Inference Energy Consumption

Published: 3 months ago (January 29, 2026 at 01:16 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2601.22076v1

Overview

The paper Where Do the Joules Go? Diagnosing Inference Energy Consumption offers the first large‑scale, systematic look at how much electricity modern generative‑AI models actually burn during inference. By measuring 46 models across seven tasks on NVIDIA H100 and B200 GPUs, the authors expose staggering energy gaps—up to 25× between different LLM tasks and more than 100× between video‑generation and image‑generation workloads. Their work goes beyond “what” the numbers are; it builds a diagnostic framework that ties energy use to hidden factors like memory traffic and GPU utilization, giving developers a roadmap for energy‑aware optimization.

Key Contributions

Comprehensive measurement suite: 1,858 configuration points covering 46 models (LLMs, diffusion, GANs, etc.) on two state‑of‑the‑art GPUs.
Empirical energy taxonomy: Quantifies how task type, model size, batch size, precision, and hardware choice each affect inference energy, revealing order‑of‑magnitude differences.
Diagnostic framework: Introduces a layered model that maps observable metrics (time, power) to latent drivers (memory bandwidth, compute utilization, kernel efficiency).
Throughput‑per‑watt analysis: Extends the framework to the “performance per watt” metric that datacenter operators care about for cost and sustainability.
Open‑source tooling & dataset: Releases the measurement scripts and raw logs, enabling reproducibility and further community research.

Methodology

Benchmark selection – The authors chose a diverse set of generative‑AI workloads (text generation, summarization, image diffusion, video synthesis, etc.) and representative model families (GPT‑style LLMs, Stable Diffusion, VQ‑GAN, etc.).
Configuration sweep – For each model they varied batch size, precision (FP16/FP32/BF16), and inference mode (eager vs. compiled) to generate 1,858 distinct runs.
Instrumentation – Power draw was captured via NVIDIA’s NVML API at 1 kHz resolution, while timestamps, GPU utilization, memory usage, and kernel statistics were logged with Nsight Systems.
Normalization – Energy (Joules) was computed as the integral of power over the inference interval, then normalized by the number of generated tokens/images/frames to enable apples‑to‑apples comparisons.
Framework construction – Using regression and correlation analysis, the authors identified latent variables (e.g., memory‑bound vs. compute‑bound phases) that best explain observed energy variance.

The approach is deliberately hardware‑agnostic: the same pipeline can be applied to any GPU that exposes power and performance counters, making the study reproducible on future accelerator generations.

Results & Findings

Factor	Observed Energy Impact
LLM task type	25× energy difference between, e.g., code generation vs. chat completion (same model size).
Media modality	Video generation consumes >100× the energy of single‑image diffusion for comparable visual quality.
GPU utilization	Low utilization (≤30 %) leads to 3–5× higher Joules per token compared to well‑packed batches.
Precision	Switching from FP32 to BF16 cuts energy by ~30 % with negligible quality loss for most tasks.
Batch size	Increasing batch size up to the GPU’s memory limit yields near‑linear energy efficiency gains, but oversubscribing memory causes spikes due to paging.
Hardware	H100 outperforms B200 by ~2× in throughput‑per‑watt for large LLMs, but the gap narrows for smaller diffusion models.

The diagnostic framework shows that memory bandwidth pressure is the primary driver of high energy consumption in video synthesis, while compute saturation dominates LLM token generation. Moreover, the authors demonstrate that throughput‑per‑watt can be maximized by co‑optimizing batch size, precision, and kernel fusion to keep both compute and memory pipelines busy.

Practical Implications

Model‑serving engineers can immediately apply the batch‑size‑and‑precision guidelines to cut operational costs without sacrificing quality.
Cloud providers gain a quantitative basis for pricing “energy‑aware” inference endpoints, potentially offering cheaper rates for workloads that stay in the compute‑bound regime.
Hardware architects receive concrete evidence that future GPUs should prioritize balanced memory bandwidth and on‑chip cache for video‑generation pipelines.
Sustainability teams can use the throughput‑per‑watt metric to benchmark datacenter upgrades and justify investments in newer accelerators.
Framework developers (e.g., PyTorch, TensorFlow) can integrate the authors’ profiling hooks to surface latent utilization metrics in their performance dashboards, giving developers actionable feedback during model deployment.

In short, the paper equips developers with a diagnostic checklist: measure power, monitor GPU utilization, adjust batch size/precision, and target the right hardware for the workload’s memory vs. compute profile.

Limitations & Future Work

Hardware scope: The study is limited to NVIDIA H100 and B200 GPUs; results may differ on AMD or specialized ASICs.
Inference‑only focus: Training energy dynamics are not explored, though many of the same latent factors likely apply.
Static workloads: Real‑world serving stacks often involve request multiplexing and dynamic batching, which could introduce additional variability not captured in the controlled experiments.
Model diversity: While 46 models are extensive, emerging multimodal transformers and retrieval‑augmented generation were not included.

The authors suggest extending the framework to heterogeneous clusters, incorporating dynamic workload scheduling, and exploring energy‑aware compiler optimizations as promising directions for follow‑up research.

Authors

Jae-Won Chung
Ruofan Wu
Jeff J. Ma
Mosharaf Chowdhury

Paper Information

arXiv ID: 2601.22076v1
Categories: cs.LG, cs.DC
Published: January 29, 2026
PDF: Download PDF

[Paper] Where Do the Joules Go? Diagnosing Inference Energy Consumption

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation

[Paper] End-to-end Optimization of Belief and Policy Learning in Shared Autonomy Paradigms

[Paper] Decoupled Diffusion Sampling for Inverse Problems on Function Spaces

[Paper] FOCUS: DLLMs Know How to Tame Their Compute Bound