[Paper] Towards Green AI: Decoding the Energy of LLM Inference in Software Development
Source: arXiv - 2602.05712v1
Overview
Large language models (LLMs) are now the engine behind many AI‑assisted developer tools—code completion, bug‑fix suggestions, automated testing, and more. But every token they generate costs energy, and at scale this adds up to a non‑trivial carbon footprint. The paper Towards Green AI: Decoding the Energy of LLM Inference in Software Development dissects where that energy goes during inference and proposes a lightweight fix that can slash consumption by up to 89 % without hurting code‑generation quality.
Key Contributions
- Phase‑level energy profiling – separates the prefill (input encoding) and decoding (token generation) stages for a fine‑grained view of power draw.
- Empirical study on 10 transformer models – six 6‑7 B‑parameter and four 3‑4 B‑parameter models evaluated on code‑centric benchmarks (HumanEval for generation, LongBench for understanding).
- Discovery of “babbling” behavior – three models produce unnecessary filler tokens, inflating decoding energy.
- Babbling‑suppression technique – a simple post‑processing filter that trims superfluous output, delivering 44‑89 % energy savings while preserving generation accuracy.
- Quantified prefill‑decoding interaction – shows that higher prefill costs amplify per‑token decoding energy by 1.3 %–51.8 % depending on the model.
Methodology
- Model selection – The authors chose ten open‑source transformer LLMs ranging from 3 B to 7 B parameters, covering both decoder‑only and encoder‑decoder architectures commonly used in code‑related AI tools.
- Benchmarking –
- HumanEval: a suite of Python programming problems that measures a model’s ability to generate correct, runnable code.
- LongBench: a set of longer‑context code‑understanding tasks (e.g., code summarization, bug detection).
- Energy measurement – Inference runs were executed on identical hardware (NVIDIA A100 GPUs) while power draw was logged with a high‑resolution power meter. Energy was logged separately for:
- Prefill – processing the prompt and building KV‑cache.
- Decoding – generating each output token using the cached state.
- Babbling detection – Output streams were examined for low‑information “filler” tokens (e.g., repetitive comments, stray whitespace). A heuristic based on token entropy and length flagged babbling instances.
- Suppression strategy – When babbling was detected, the decoder was instructed to stop early or to prune low‑confidence tokens, effectively shortening the decoding phase.
All steps were scripted so that the pipeline can be reproduced on other models or hardware setups.
Results & Findings
| Metric | 6‑7 B Models | 3‑4 B Models |
|---|---|---|
| Prefill energy share | 15‑30 % of total inference energy | 10‑25 % |
| Decoding energy per token | 0.45 J/token (baseline) | 0.30 J/token (baseline) |
| Prefill‑decoding amplification | +1.3 % to +51.8 % per‑token cost | +3.2 % to +38.4 % |
| Babbling prevalence | 3 / 6 models exhibited babbling | 0 / 4 models |
| Energy saved by babbling suppression | 44 %‑89 % reduction in decoding energy | 48 %‑85 % (where applicable) |
| Impact on generation accuracy | No statistically significant drop (HumanEval pass@1 unchanged) | Same |
Key takeaways
- Decoding dominates the energy budget (≈70‑85 % of total).
- A “heavy” prefill stage can make each subsequent token more expensive, likely because larger KV‑cache look‑ups stress memory bandwidth.
- Babbling is not a rare edge case; when present, it inflates decoding time and power dramatically.
- Simple early‑stop or token‑pruning heuristics can eliminate most of the waste without harming the functional output.
Practical Implications
| Audience | How to Apply the Findings |
|---|---|
| Tool developers (e.g., GitHub Copilot, Tabnine) | Integrate a lightweight babbling detector into the generation pipeline; stop decoding once confidence drops below a threshold. |
| Cloud AI service providers | Offer “green mode” APIs that cap prefill length or enforce KV‑cache size limits, reducing per‑request energy and cost. |
| DevOps / SRE teams | Monitor inference power per request; use the paper’s profiling methodology to set alerts for abnormal energy spikes (possible babbling). |
| Hardware architects | Prioritize memory bandwidth and cache‑friendly KV‑cache designs, as prefill‑decoding coupling suggests memory efficiency directly impacts energy per token. |
| Open‑source model maintainers | Publish model cards that include prefill/decoding energy profiles; consider training regimes that discourage repetitive filler generation. |
Overall, the research gives a concrete, low‑overhead lever—babbling suppression—that can be dropped into existing inference stacks to achieve immediate sustainability gains. It also nudges the community toward more holistic energy‑aware benchmarking rather than focusing solely on latency or accuracy.
Limitations & Future Work
- Hardware scope – Experiments were confined to a single GPU generation (A100). Energy dynamics may differ on edge devices, CPUs, or upcoming accelerator architectures.
- Model diversity – Only transformer‑based LLMs in the 3‑7 B range were examined; larger models (e.g., 30 B+) or specialized code models (Codex, CodeLlama) could exhibit different prefill‑decoding relationships.
- Babbling definition – The heuristic is based on token entropy and length; more nuanced semantic analysis (e.g., detecting meaningless comments) could improve detection precision.
- User‑experience impact – While accuracy stayed stable in benchmark tests, real‑world developer workflows might be sensitive to early stopping or reduced verbosity. User studies are needed.
Future research directions suggested by the authors include extending the profiling framework to multi‑GPU and distributed inference setups, exploring training‑time interventions that reduce babbling propensity, and building standardized “green AI” benchmarks that combine energy, latency, and code‑quality metrics.
Authors
- Lola Solovyeva
- Fernando Castor
Paper Information
- arXiv ID: 2602.05712v1
- Categories: cs.SE, cs.AI
- Published: February 5, 2026
- PDF: Download PDF