[Paper] DeCEAT: Decoding Carbon Emissions for AI-driven Software Testing

Published: 2 months ago (February 20, 2026 at 12:54 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.18012v1

Overview

The paper introduces DeCEAT, a framework that measures the carbon footprint and performance of small language models (SLMs) when they’re used to generate automated software tests. By combining energy‑tracking tools with classic test‑quality metrics, the authors show that sustainability isn’t a single number – it depends on model choice, prompt design, and the trade‑offs developers are willing to make.

Key Contributions

First systematic sustainability audit for SLM‑driven test generation – moves the focus beyond the usual large‑model analyses.
DeCEAT framework that couples CodeCarbon (energy & CO₂ tracking) with HumanEval‑based test coverage, all under reproducible hardware and runtime settings.
Prompt‑variant study using the Anthropic template to demonstrate how subtle changes in prompt wording affect both emissions and test quality.
Empirical trade‑off matrix that categorizes models by (i) low‑energy/fast, (ii) high‑stability, and (iii) high‑accuracy under carbon constraints.
Open‑source tooling and a reproducible benchmark suite, enabling other teams to plug in their own models or prompts.

Methodology

Benchmark selection – The authors use the HumanEval suite (a collection of 164 Python coding problems) as a realistic proxy for test‑generation workloads.
Model roster – Several publicly available SLMs (e.g., GPT‑2‑small, LLaMA‑7B, Anthropic‑Claude‑mini) are evaluated under identical hardware (single‑GPU, fixed batch size).
Prompt engineering – Two prompt families are built from the Anthropic “template” style: a baseline prompt and three adaptive variants that tweak temperature, few‑shot examples, and instruction phrasing.
Energy tracking – CodeCarbon runs alongside each inference session, logging power draw, runtime, and estimated CO₂ emissions (based on regional electricity mix).
Quality assessment – Generated unit tests are run against the hidden solutions; coverage, pass‑rate, and flakiness are recorded.
Time‑aware analysis – Results are plotted on a three‑axis chart (energy, latency, coverage) to surface multidimensional sustainability profiles.

Results & Findings

Model / Prompt	Avg. Energy (Wh)	Avg. Latency (s)	Test Coverage ↑	Stability (flaky % ↓)
GPT‑2‑small (baseline)	0.42	12.3	71%	8%
GPT‑2‑small (adaptive)	0.31	9.8	68%	5%
LLaMA‑7B (baseline)	0.78	21.5	84%	6%
LLaMA‑7B (adaptive)	0.65	19.2	82%	4%
Claude‑mini (baseline)	0.55	14.0	77%	7%
Claude‑mini (adaptive)	0.44	11.5	79%	5%

Energy vs. Accuracy trade‑off: Larger SLMs (LLaMA‑7B) achieve higher coverage but consume ~2× the energy of the smallest model.
Prompt impact: Adaptive prompts consistently shave 10‑15 % off both energy and latency while keeping coverage within a few points of the baseline.
Stability gains: Prompt tweaks that add clearer instruction boundaries reduce flaky test generation by up to 3 percentage points.
Multidimensional sustainability: No single model dominates across all axes; developers must prioritize based on project constraints (e.g., CI budget vs. test thoroughness).

Practical Implications

CI/CD cost budgeting: Teams can now estimate the carbon cost of adding AI‑generated tests to their pipelines and choose a model/prompt combo that fits a given carbon budget.
Prompt‑first optimization: Instead of swapping out models, a quick prompt rewrite can deliver measurable energy savings—useful for rapid prototyping.
Green‑by‑design testing tools: The DeCEAT framework can be integrated into test‑generation libraries (e.g., pytest-gen, codex-cli) to expose real‑time emission stats to developers.
Policy & reporting: Organizations seeking ESG compliance can leverage the provided metrics to report AI‑induced emissions for software quality activities.
Model selection guidance: For projects where test coverage is critical (e.g., safety‑critical code), a larger SLM may be justified; for routine codebases, a small SLM with adaptive prompts offers a low‑carbon alternative.

Limitations & Future Work

Hardware specificity: Experiments were run on a single GPU type; results may shift on CPUs, TPUs, or newer accelerator generations.
Scope of benchmarks: HumanEval covers Python functions; other languages or larger codebases could exhibit different energy‑quality dynamics.
Carbon factor granularity: Emission estimates rely on average regional electricity mixes; real‑world data centers may have greener or dirtier grids.
Prompt space exploration: Only a handful of adaptive prompts were tested; automated prompt‑search (e.g., reinforcement learning) could uncover even better sustainability profiles.
Model ecosystem: The study focuses on a limited set of open SLMs; extending to proprietary models (e.g., OpenAI’s Codex) would broaden applicability.

Authors

Pragati Kumari
Novarun Deb

Paper Information

arXiv ID: 2602.18012v1
Categories: cs.SE
Published: February 20, 2026
PDF: Download PDF

[Paper] DeCEAT: Decoding Carbon Emissions for AI-driven Software Testing

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

Why 76% of AI Agent Deployments Fail (And How to Test Yours)

Why LLMs Alone Are Not Agents

[Paper] Differences in Typological Alignment in Language Models' Treatment of Differential Argument Marking

Protecting Language Models Against Unauthorized Distillation through Trace Rewriting