[Paper] TokenPowerBench: Benchmarking the Power Consumption of LLM Inference
Source: arXiv - 2512.03024v1
Overview
Large language models (LLMs) are now answering billions of queries daily, and the bulk of their energy use comes from inference, not training. The paper introduces TokenPowerBench, the first open‑source benchmark that lets engineers measure and analyze the power consumption of LLM inference down to the joule‑per‑token level, without needing expensive hardware meters.
Key Contributions
- Declarative benchmark configuration – a simple YAML/JSON interface to pick model, prompt set, batch size, quantization, and inference engine.
- Unified power‑measurement layer – captures GPU, node, and whole‑system power using only software‑accessible counters (e.g., NVIDIA‑SMI, RAPL), eliminating the need for external meters.
- Phase‑aligned metrics pipeline – splits energy accounting into prefill (context loading) and decode (token generation) for every request, yielding “joules per token” and “joules per prefill token”.
- Extensive evaluation – applied to Llama, Falcon, Qwen, and Mistral families ranging from 1 B to 405 B parameters, covering a variety of batch sizes, context lengths, parallelism strategies, and quantization schemes.
- Open‑source release – the full benchmark suite, data collection scripts, and analysis notebooks are publicly available to foster reproducible power‑efficiency research.
Methodology
- Configuration – Users write a short declarative file specifying the model checkpoint, the set of prompts (including length distribution), batch size, and the inference backend (e.g., HuggingFace Transformers, vLLM, TensorRT‑LLM).
- Instrumentation – During a run, TokenPowerBench polls power‑reading APIs (NVIDIA‑SMI for GPU, Intel‑RAPL for CPU, OS‑level counters for the whole node) at a configurable interval (default 10 ms).
- Phase tagging – The benchmark inserts lightweight hooks into the inference loop to mark the start/end of the prefill and decode phases for each request.
- Energy attribution – Collected power samples are integrated over time and then allocated proportionally to the active phase(s), producing per‑token energy numbers.
- Analysis – A post‑processing script aggregates results across runs, normalizes by token count, and visualizes how batch size, context length, quantization (e.g., INT8, FP16), and parallelism (tensor‑ vs pipeline‑parallel) affect energy efficiency.
The whole pipeline runs on a single node or a multi‑node cluster, and because it relies only on standard system interfaces, it can be dropped into existing CI/CD pipelines for continuous power‑efficiency monitoring.
Results & Findings
| Model (params) | Batch size | Context len | Quantization | Joules / token (decode) |
|---|---|---|---|---|
| Llama‑2‑7B | 1 | 512 | FP16 | 0.12 J |
| Llama‑2‑7B | 32 | 512 | FP16 | 0.045 J |
| Falcon‑40B | 8 | 1024 | INT8 | 0.09 J |
| Mistral‑7B‑V0.1 | 16 | 2048 | FP16 | 0.07 J |
| Llama‑3‑405B | 1 | 2048 | BF16 | 0.31 J |
Key takeaways
- Batching wins – Scaling batch size from 1 to 32 cuts per‑token energy by ~60 % because GPU utilization rises sharply.
- Context length matters – Prefill energy grows roughly linearly with context size; decoding cost stays flat.
- Quantization pays off – INT8 quantization reduces decode energy by ~25 % with negligible quality loss for many workloads.
- Parallelism trade‑offs – Tensor‑parallelism improves throughput but can increase total node‑level power; the benchmark quantifies the net joules‑per‑token impact.
- Frontier models are still expensive – The 405 B Llama‑3 model consumes >0.3 J per token, highlighting the need for aggressive quantization or specialized hardware for cost‑effective deployment.
Practical Implications
- Cost forecasting – Operators can plug TokenPowerBench into their deployment pipelines to predict electricity bills (e.g., $/M tokens) and compare cloud‑provider pricing models.
- Sustainability reporting – The per‑token energy numbers enable precise carbon‑footprint calculations for LLM services, supporting ESG compliance.
- Hardware selection – By running the same benchmark on different GPUs (A100 vs H100 vs consumer‑grade RTX) developers can make data‑driven choices about hardware upgrades.
- Optimization loops – Teams can automatically test the impact of new quantization tricks, kernel libraries, or inference engines, closing the gap between research prototypes and production‑grade efficiency.
- Service‑level agreements (SLAs) – Energy‑aware metrics can be added to SLAs (e.g., “≤ 0.08 J per token for 99 % of requests”), giving customers transparency on operational sustainability.
Limitations & Future Work
- Hardware dependency – The current power‑reading approach works best on NVIDIA GPUs and Intel CPUs; AMD or ARM platforms need additional adapters.
- Granularity of phase tagging – Very short prompts (< 10 tokens) can cause timing noise, making per‑token attribution less stable.
- Model‑specific overheads – The benchmark does not yet capture memory‑controller power or cooling system variations that can dominate in large‑scale clusters.
- Future directions – Extending support to edge‑device inference, integrating with emerging low‑power accelerators (e.g., Habana, Gaudi), and adding automated “energy‑budget” tuning loops that adjust batch size or quantization on‑the‑fly.
Authors
- Chenxu Niu
- Wei Zhang
- Jie Li
- Yongjian Zhao
- Tongyang Wang
- Xi Wang
- Yong Chen
Paper Information
- arXiv ID: 2512.03024v1
- Categories: cs.LG, cs.AI, cs.CY, cs.DC
- Published: December 2, 2025
- PDF: Download PDF