[Paper] Towards Green AI: Decoding the Energy of LLM Inference in Software Development

Published: (February 5, 2026 at 09:38 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.05712v1

Overview

Large language models (LLMs) are now the engine behind many AI‑assisted developer tools—code completion, bug‑fix suggestions, automated testing, and more. But every token they generate costs energy, and at scale this adds up to a non‑trivial carbon footprint. The paper Towards Green AI: Decoding the Energy of LLM Inference in Software Development dissects where that energy goes during inference and proposes a lightweight fix that can slash consumption by up to 89 % without hurting code‑generation quality.

Key Contributions

  • Phase‑level energy profiling – separates the prefill (input encoding) and decoding (token generation) stages for a fine‑grained view of power draw.
  • Empirical study on 10 transformer models – six 6‑7 B‑parameter and four 3‑4 B‑parameter models evaluated on code‑centric benchmarks (HumanEval for generation, LongBench for understanding).
  • Discovery of “babbling” behavior – three models produce unnecessary filler tokens, inflating decoding energy.
  • Babbling‑suppression technique – a simple post‑processing filter that trims superfluous output, delivering 44‑89 % energy savings while preserving generation accuracy.
  • Quantified prefill‑decoding interaction – shows that higher prefill costs amplify per‑token decoding energy by 1.3 %–51.8 % depending on the model.

Methodology

  1. Model selection – The authors chose ten open‑source transformer LLMs ranging from 3 B to 7 B parameters, covering both decoder‑only and encoder‑decoder architectures commonly used in code‑related AI tools.
  2. Benchmarking
    • HumanEval: a suite of Python programming problems that measures a model’s ability to generate correct, runnable code.
    • LongBench: a set of longer‑context code‑understanding tasks (e.g., code summarization, bug detection).
  3. Energy measurement – Inference runs were executed on identical hardware (NVIDIA A100 GPUs) while power draw was logged with a high‑resolution power meter. Energy was logged separately for:
    • Prefill – processing the prompt and building KV‑cache.
    • Decoding – generating each output token using the cached state.
  4. Babbling detection – Output streams were examined for low‑information “filler” tokens (e.g., repetitive comments, stray whitespace). A heuristic based on token entropy and length flagged babbling instances.
  5. Suppression strategy – When babbling was detected, the decoder was instructed to stop early or to prune low‑confidence tokens, effectively shortening the decoding phase.

All steps were scripted so that the pipeline can be reproduced on other models or hardware setups.

Results & Findings

Metric6‑7 B Models3‑4 B Models
Prefill energy share15‑30 % of total inference energy10‑25 %
Decoding energy per token0.45 J/token (baseline)0.30 J/token (baseline)
Prefill‑decoding amplification+1.3 % to +51.8 % per‑token cost+3.2 % to +38.4 %
Babbling prevalence3 / 6 models exhibited babbling0 / 4 models
Energy saved by babbling suppression44 %‑89 % reduction in decoding energy48 %‑85 % (where applicable)
Impact on generation accuracyNo statistically significant drop (HumanEval pass@1 unchanged)Same

Key takeaways

  • Decoding dominates the energy budget (≈70‑85 % of total).
  • A “heavy” prefill stage can make each subsequent token more expensive, likely because larger KV‑cache look‑ups stress memory bandwidth.
  • Babbling is not a rare edge case; when present, it inflates decoding time and power dramatically.
  • Simple early‑stop or token‑pruning heuristics can eliminate most of the waste without harming the functional output.

Practical Implications

AudienceHow to Apply the Findings
Tool developers (e.g., GitHub Copilot, Tabnine)Integrate a lightweight babbling detector into the generation pipeline; stop decoding once confidence drops below a threshold.
Cloud AI service providersOffer “green mode” APIs that cap prefill length or enforce KV‑cache size limits, reducing per‑request energy and cost.
DevOps / SRE teamsMonitor inference power per request; use the paper’s profiling methodology to set alerts for abnormal energy spikes (possible babbling).
Hardware architectsPrioritize memory bandwidth and cache‑friendly KV‑cache designs, as prefill‑decoding coupling suggests memory efficiency directly impacts energy per token.
Open‑source model maintainersPublish model cards that include prefill/decoding energy profiles; consider training regimes that discourage repetitive filler generation.

Overall, the research gives a concrete, low‑overhead lever—babbling suppression—that can be dropped into existing inference stacks to achieve immediate sustainability gains. It also nudges the community toward more holistic energy‑aware benchmarking rather than focusing solely on latency or accuracy.

Limitations & Future Work

  • Hardware scope – Experiments were confined to a single GPU generation (A100). Energy dynamics may differ on edge devices, CPUs, or upcoming accelerator architectures.
  • Model diversity – Only transformer‑based LLMs in the 3‑7 B range were examined; larger models (e.g., 30 B+) or specialized code models (Codex, CodeLlama) could exhibit different prefill‑decoding relationships.
  • Babbling definition – The heuristic is based on token entropy and length; more nuanced semantic analysis (e.g., detecting meaningless comments) could improve detection precision.
  • User‑experience impact – While accuracy stayed stable in benchmark tests, real‑world developer workflows might be sensitive to early stopping or reduced verbosity. User studies are needed.

Future research directions suggested by the authors include extending the profiling framework to multi‑GPU and distributed inference setups, exploring training‑time interventions that reduce babbling propensity, and building standardized “green AI” benchmarks that combine energy, latency, and code‑quality metrics.

Authors

  • Lola Solovyeva
  • Fernando Castor

Paper Information

  • arXiv ID: 2602.05712v1
  • Categories: cs.SE, cs.AI
  • Published: February 5, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »