[Paper] SWEnergy: An Empirical Study on Energy Efficiency in Agentic Issue Resolution Frameworks with SLMs

Published: (December 10, 2025 at 06:28 AM EST)
3 min read
Source: arXiv

Source: arXiv - 2512.09543v1

Overview

The paper SWEnergy investigates how well current autonomous‑agent frameworks for software‑issue resolution work when they are forced to use small language models (SLMs) instead of the massive, proprietary LLMs they were built for. By measuring energy use, runtime, token consumption, and memory on a standard benchmark, the authors reveal that many of these frameworks waste a lot of compute without actually solving problems.

Key Contributions

  • Empirical comparison of four popular agentic frameworks (SWE‑Agent, OpenHands, Mini SWE Agent, AutoCodeRover) when run with two SLMs (Gemma‑3 4B and Qwen‑3 1.7B).
  • Energy‑efficiency profiling on fixed hardware (energy, duration, token count, memory) across 150 runs per configuration.
  • Identification of the primary bottleneck: framework architecture drives energy consumption far more than the underlying model size.
  • Evidence of “wasted reasoning” – most energy is spent in unproductive loops, leading to near‑zero task‑completion rates.
  • Guidelines for low‑energy designs, suggesting a shift from passive orchestration to active management of SLM weaknesses.

Methodology

  1. Benchmark selection – The authors used the SWE‑bench Verified Mini suite, a curated set of realistic software‑bug‑fix and code‑generation tasks.
  2. Framework & model matrix – Each of the four frameworks was paired with each of the two SLMs, yielding eight configurations.
  3. Controlled environment – All experiments ran on identical hardware (CPU‑only, fixed RAM) to isolate software‑level differences.
  4. Instrumentation – Energy draw was captured via a power meter, while runtime, token usage, and memory footprints were logged automatically.
  5. Repetition – 150 independent runs per configuration ensured statistical significance and mitigated stochastic variance.
  6. Success metric – A task was considered solved if the generated patch passed all verification tests in the benchmark.

Results & Findings

Framework (SLM)Avg. Energy (× baseline)Success RateMain Observation
AutoCodeRover (Gemma‑3)9.4×≈0 %Highest energy waste; many idle reasoning cycles.
SWE‑Agent (Qwen‑3)6.2×≈0 %Energy dominated by repeated prompting.
Mini SWE Agent (Gemma‑3)4.8×≈0 %Slightly better but still inefficient.
OpenHands (Gemma‑3)1.0× (baseline)≈0 %Lowest energy; still fails to solve tasks.
  • Energy vs. Architecture: The same SLM consumed up to 9.4× more energy depending solely on the surrounding framework.
  • Success near zero: Regardless of energy spent, all configurations failed to resolve the majority of tasks, confirming that SLM reasoning capacity—not just orchestration—limits success.
  • Token & Memory: Higher‑energy frameworks also generated more tokens and used more memory, reinforcing the “busy‑work” pattern.

Practical Implications

  • Don’t assume plug‑and‑play: Swapping a powerful LLM for an SLM in existing agentic pipelines can dramatically increase power bills while delivering no functional gain.
  • Framework choice matters: For edge devices or on‑premise CI/CD bots where energy is a premium, lightweight orchestrators like OpenHands (or custom minimal loops) are preferable.
  • Design for SLM limits: Architects should embed active error detection, early termination, and fallback strategies (e.g., hybrid LLM calls) to avoid endless reasoning loops.
  • Cost‑aware CI: Teams can use the paper’s profiling methodology to benchmark their own agents, ensuring that any energy savings from smaller models aren’t offset by bloated orchestration.
  • Potential for hybrid solutions: A small model could handle cheap, repetitive tasks (e.g., linting, template generation) while a larger model is invoked only when the SLM signals uncertainty.

Limitations & Future Work

  • Hardware scope: Experiments were limited to CPU‑only machines; GPU‑accelerated SLMs might exhibit different energy profiles.
  • Benchmark diversity: Only the SWE‑bench Verified Mini suite was used; broader software‑engineering tasks (e.g., documentation, design) remain untested.
  • Model selection: The study focused on two SLMs; newer open‑source models (e.g., Llama‑3, Mistral‑7B) could behave differently.
  • Framework evolution: All four frameworks were evaluated in their current releases; future versions may incorporate SLM‑aware optimizations.

The authors suggest exploring adaptive orchestration—frameworks that monitor SLM confidence and dynamically switch to more capable models or terminate early—to turn the observed energy waste into a tractable, low‑power solution.

Authors

  • Arihant Tripathy
  • Ch Pavan Harshit
  • Karthik Vaidhyanathan

Paper Information

  • arXiv ID: 2512.09543v1
  • Categories: cs.SE, cs.AI
  • Published: December 10, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »