[Paper] Towards Green AI: Decoding the Energy of LLM Inference in Software Development

Published: 2 months ago (February 5, 2026 at 09:38 AM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.05712v1

Overview

Large language models (LLMs) are now the engine behind many AI‑assisted developer tools—code completion, bug‑fix suggestions, automated testing, and more. But every token they generate costs energy, and at scale this adds up to a non‑trivial carbon footprint. The paper Towards Green AI: Decoding the Energy of LLM Inference in Software Development dissects where that energy goes during inference and proposes a lightweight fix that can slash consumption by up to 89 % without hurting code‑generation quality.

Key Contributions

Phase‑level energy profiling – separates the prefill (input encoding) and decoding (token generation) stages for a fine‑grained view of power draw.
Empirical study on 10 transformer models – six 6‑7 B‑parameter and four 3‑4 B‑parameter models evaluated on code‑centric benchmarks (HumanEval for generation, LongBench for understanding).
Discovery of “babbling” behavior – three models produce unnecessary filler tokens, inflating decoding energy.
Babbling‑suppression technique – a simple post‑processing filter that trims superfluous output, delivering 44‑89 % energy savings while preserving generation accuracy.
Quantified prefill‑decoding interaction – shows that higher prefill costs amplify per‑token decoding energy by 1.3 %–51.8 % depending on the model.

Methodology

Model selection – The authors chose ten open‑source transformer LLMs ranging from 3 B to 7 B parameters, covering both decoder‑only and encoder‑decoder architectures commonly used in code‑related AI tools.
Benchmarking –
- HumanEval: a suite of Python programming problems that measures a model’s ability to generate correct, runnable code.
- LongBench: a set of longer‑context code‑understanding tasks (e.g., code summarization, bug detection).
Energy measurement – Inference runs were executed on identical hardware (NVIDIA A100 GPUs) while power draw was logged with a high‑resolution power meter. Energy was logged separately for:
- Prefill – processing the prompt and building KV‑cache.
- Decoding – generating each output token using the cached state.
Babbling detection – Output streams were examined for low‑information “filler” tokens (e.g., repetitive comments, stray whitespace). A heuristic based on token entropy and length flagged babbling instances.
Suppression strategy – When babbling was detected, the decoder was instructed to stop early or to prune low‑confidence tokens, effectively shortening the decoding phase.

All steps were scripted so that the pipeline can be reproduced on other models or hardware setups.

Results & Findings

Metric	6‑7 B Models	3‑4 B Models
Prefill energy share	15‑30 % of total inference energy	10‑25 %
Decoding energy per token	0.45 J/token (baseline)	0.30 J/token (baseline)
Prefill‑decoding amplification	+1.3 % to +51.8 % per‑token cost	+3.2 % to +38.4 %
Babbling prevalence	3 / 6 models exhibited babbling	0 / 4 models
Energy saved by babbling suppression	44 %‑89 % reduction in decoding energy	48 %‑85 % (where applicable)
Impact on generation accuracy	No statistically significant drop (HumanEval pass@1 unchanged)	Same

Key takeaways

Decoding dominates the energy budget (≈70‑85 % of total).
A “heavy” prefill stage can make each subsequent token more expensive, likely because larger KV‑cache look‑ups stress memory bandwidth.
Babbling is not a rare edge case; when present, it inflates decoding time and power dramatically.
Simple early‑stop or token‑pruning heuristics can eliminate most of the waste without harming the functional output.

Practical Implications

Audience	How to Apply the Findings
Tool developers (e.g., GitHub Copilot, Tabnine)	Integrate a lightweight babbling detector into the generation pipeline; stop decoding once confidence drops below a threshold.
Cloud AI service providers	Offer “green mode” APIs that cap prefill length or enforce KV‑cache size limits, reducing per‑request energy and cost.
DevOps / SRE teams	Monitor inference power per request; use the paper’s profiling methodology to set alerts for abnormal energy spikes (possible babbling).
Hardware architects	Prioritize memory bandwidth and cache‑friendly KV‑cache designs, as prefill‑decoding coupling suggests memory efficiency directly impacts energy per token.
Open‑source model maintainers	Publish model cards that include prefill/decoding energy profiles; consider training regimes that discourage repetitive filler generation.

Overall, the research gives a concrete, low‑overhead lever—babbling suppression—that can be dropped into existing inference stacks to achieve immediate sustainability gains. It also nudges the community toward more holistic energy‑aware benchmarking rather than focusing solely on latency or accuracy.

Limitations & Future Work

Hardware scope – Experiments were confined to a single GPU generation (A100). Energy dynamics may differ on edge devices, CPUs, or upcoming accelerator architectures.
Model diversity – Only transformer‑based LLMs in the 3‑7 B range were examined; larger models (e.g., 30 B+) or specialized code models (Codex, CodeLlama) could exhibit different prefill‑decoding relationships.
Babbling definition – The heuristic is based on token entropy and length; more nuanced semantic analysis (e.g., detecting meaningless comments) could improve detection precision.
User‑experience impact – While accuracy stayed stable in benchmark tests, real‑world developer workflows might be sensitive to early stopping or reduced verbosity. User studies are needed.

Future research directions suggested by the authors include extending the profiling framework to multi‑GPU and distributed inference setups, exploring training‑time interventions that reduce babbling propensity, and building standardized “green AI” benchmarks that combine energy, latency, and code‑quality metrics.

Authors

Lola Solovyeva
Fernando Castor

Paper Information

arXiv ID: 2602.05712v1
Categories: cs.SE, cs.AI
Published: February 5, 2026
PDF: Download PDF

[Paper] Towards Green AI: Decoding the Energy of LLM Inference in Software Development

Overview

Key Contributions

Methodology

Results & Findings

Key takeaways

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] InftyThink+: Effective and Efficient Infinite-Horizon Reasoning via Reinforcement Learning

[Paper] Optimal Derivative Feedback Control for an Active Magnetic Levitation System: An Experimental Study on Data-Driven Approaches

[Paper] Optimal Turkish Subword Strategies at Scale: Systematic Evaluation of Data, Vocabulary, Morphology Interplay

[Paper] Reliable Mislabel Detection for Video Capsule Endoscopy Data