[Paper] TokenPowerBench: Benchmarking the Power Consumption of LLM Inference

Published: 2 months ago (December 2, 2025 at 01:50 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.03024v1

Overview

Large language models (LLMs) are now answering billions of queries daily, and the bulk of their energy use comes from inference, not training. The paper introduces TokenPowerBench, the first open‑source benchmark that lets engineers measure and analyze the power consumption of LLM inference down to the joule‑per‑token level, without needing expensive hardware meters.

Key Contributions

Declarative benchmark configuration – a simple YAML/JSON interface to pick model, prompt set, batch size, quantization, and inference engine.
Unified power‑measurement layer – captures GPU, node, and whole‑system power using only software‑accessible counters (e.g., NVIDIA‑SMI, RAPL), eliminating the need for external meters.
Phase‑aligned metrics pipeline – splits energy accounting into prefill (context loading) and decode (token generation) for every request, yielding “joules per token” and “joules per prefill token”.
Extensive evaluation – applied to Llama, Falcon, Qwen, and Mistral families ranging from 1 B to 405 B parameters, covering a variety of batch sizes, context lengths, parallelism strategies, and quantization schemes.
Open‑source release – the full benchmark suite, data collection scripts, and analysis notebooks are publicly available to foster reproducible power‑efficiency research.

Methodology

Configuration – Users write a short declarative file specifying the model checkpoint, the set of prompts (including length distribution), batch size, and the inference backend (e.g., HuggingFace Transformers, vLLM, TensorRT‑LLM).
Instrumentation – During a run, TokenPowerBench polls power‑reading APIs (NVIDIA‑SMI for GPU, Intel‑RAPL for CPU, OS‑level counters for the whole node) at a configurable interval (default 10 ms).
Phase tagging – The benchmark inserts lightweight hooks into the inference loop to mark the start/end of the prefill and decode phases for each request.
Energy attribution – Collected power samples are integrated over time and then allocated proportionally to the active phase(s), producing per‑token energy numbers.
Analysis – A post‑processing script aggregates results across runs, normalizes by token count, and visualizes how batch size, context length, quantization (e.g., INT8, FP16), and parallelism (tensor‑ vs pipeline‑parallel) affect energy efficiency.

The whole pipeline runs on a single node or a multi‑node cluster, and because it relies only on standard system interfaces, it can be dropped into existing CI/CD pipelines for continuous power‑efficiency monitoring.

Results & Findings

Model (params)	Batch size	Context len	Quantization	Joules / token (decode)
Llama‑2‑7B	1	512	FP16	0.12 J
Llama‑2‑7B	32	512	FP16	0.045 J
Falcon‑40B	8	1024	INT8	0.09 J
Mistral‑7B‑V0.1	16	2048	FP16	0.07 J
Llama‑3‑405B	1	2048	BF16	0.31 J

Key takeaways

Batching wins – Scaling batch size from 1 to 32 cuts per‑token energy by ~60 % because GPU utilization rises sharply.
Context length matters – Prefill energy grows roughly linearly with context size; decoding cost stays flat.
Quantization pays off – INT8 quantization reduces decode energy by ~25 % with negligible quality loss for many workloads.
Parallelism trade‑offs – Tensor‑parallelism improves throughput but can increase total node‑level power; the benchmark quantifies the net joules‑per‑token impact.
Frontier models are still expensive – The 405 B Llama‑3 model consumes >0.3 J per token, highlighting the need for aggressive quantization or specialized hardware for cost‑effective deployment.

Practical Implications

Cost forecasting – Operators can plug TokenPowerBench into their deployment pipelines to predict electricity bills (e.g., $/M tokens) and compare cloud‑provider pricing models.
Sustainability reporting – The per‑token energy numbers enable precise carbon‑footprint calculations for LLM services, supporting ESG compliance.
Hardware selection – By running the same benchmark on different GPUs (A100 vs H100 vs consumer‑grade RTX) developers can make data‑driven choices about hardware upgrades.
Optimization loops – Teams can automatically test the impact of new quantization tricks, kernel libraries, or inference engines, closing the gap between research prototypes and production‑grade efficiency.
Service‑level agreements (SLAs) – Energy‑aware metrics can be added to SLAs (e.g., “≤ 0.08 J per token for 99 % of requests”), giving customers transparency on operational sustainability.

Limitations & Future Work

Hardware dependency – The current power‑reading approach works best on NVIDIA GPUs and Intel CPUs; AMD or ARM platforms need additional adapters.
Granularity of phase tagging – Very short prompts (< 10 tokens) can cause timing noise, making per‑token attribution less stable.
Model‑specific overheads – The benchmark does not yet capture memory‑controller power or cooling system variations that can dominate in large‑scale clusters.
Future directions – Extending support to edge‑device inference, integrating with emerging low‑power accelerators (e.g., Habana, Gaudi), and adding automated “energy‑budget” tuning loops that adjust batch size or quantization on‑the‑fly.

Authors

Chenxu Niu
Wei Zhang
Jie Li
Yongjian Zhao
Tongyang Wang
Xi Wang
Yong Chen

Paper Information

arXiv ID: 2512.03024v1
Categories: cs.LG, cs.AI, cs.CY, cs.DC
Published: December 2, 2025
PDF: Download PDF

[Paper] TokenPowerBench: Benchmarking the Power Consumption of LLM Inference

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Enhancing Retrieval-Augmented Generation with Entity Linking for Educational Platforms

[Paper] Training-Time Action Conditioning for Efficient Real-Time Chunking

[Paper] Whatever Remains Must Be True: Filtering Drives Reasoning in LLMs, Shaping Diversity

[Paper] AQUA-Net: Adaptive Frequency Fusion and Illumination Aware Network for Underwater Image Enhancement