[Paper] EET: Experience-Driven Early Termination for Cost-Efficient Software Engineering Agents

Published: 1 month ago (January 9, 2026 at 08:01 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.05777v1

Overview

Large‑language‑model (LLM)‑powered software‑engineering (SE) agents are becoming everyday tools for developers—automatically generating patches, triaging bugs, and suggesting refactorings. However, each API call can cost a few cents, and a single issue often requires many back‑and‑forth iterations, inflating the total expense. The paper “EET: Experience‑Driven Early Termination for Cost‑Efficient Software Engineering Agents” proposes a lightweight, data‑driven technique that cuts those costs dramatically while keeping the agents’ success rate virtually unchanged.

Key Contributions

Experience‑driven early termination (EET): A framework that mines structured “experience” from previously solved issues and uses it to decide, mid‑generation, whether continuing an iteration is unlikely to improve the outcome.
Cross‑agent applicability: Demonstrated on three distinct SE agents (e.g., Codex‑based, GPT‑4‑based, and a fine‑tuned CodeGen model) without requiring model retraining.
Significant cost savings: Achieves a 19 %–55 % reduction in total monetary cost (average 32 %) on the SWE‑bench Verified benchmark, with at most a 0.2 % drop in issue‑resolution rate.
Token‑level efficiency: Cuts API calls by 21 %, input tokens by 30 %, and output tokens by 25 % on average, directly translating to lower cloud‑provider bills.
Open‑source release: All code, prompts, and the curated experience dataset are publicly available, enabling immediate adoption and further research.

Methodology

Collecting Experience:
- For each resolved issue, the system records a structured log: the sequence of prompts, model outputs, and the final verdict (whether the patch fixed the bug).
- These logs are abstracted into experience tuples (e.g., “patch‑generation step X with token count Y rarely leads to success for language X”).
Learning Early‑Termination Rules:
- A lightweight classifier (e.g., a decision tree) is trained on the experience tuples to predict the utility of continuing an iteration.
- The classifier operates on inexpensive features such as the number of generated tokens so far, similarity to previously successful patches, and confidence scores from the LLM.
Runtime Integration:
- During a new issue‑resolution session, after each generation step the agent queries the classifier.
- If the classifier signals a low probability of improvement, the session is terminated early, and the best‑so‑far patch is either accepted or discarded based on a simple quality check.
Evaluation Setup:
- Experiments run on the SWE‑bench Verified suite (a collection of real‑world GitHub issues with ground‑truth patches).
- Three representative agents are evaluated: a baseline Codex‑style model, a GPT‑4‑style model, and a fine‑tuned CodeGen model.
- Metrics include total monetary cost (derived from token usage), resolution rate, number of API calls, and token counts.

Results & Findings

Metric	Baseline	EET‑enabled	Improvement
Total cost	1.00×	0.68× (average)	‑32 % (range 19‑55 %)
Resolution rate	71.3 %	71.1 %	‑0.2 % (negligible)
Early‑termination hits	—	11 % of issues	–
API calls	100 %	79 %	‑21 %
Input tokens	100 %	70 %	‑30 %
Output tokens	100 %	75 %	‑25 %

Key takeaways

Cost savings are achieved by stopping unproductive loops early, not by compromising the quality of the final patch.
The approach works uniformly across different LLM back‑ends, indicating that the experience‑driven signals are model‑agnostic.
Even though early termination only triggers for roughly one‑tenth of the issues, the cumulative token reduction is substantial because those cases tend to be the most token‑heavy.

Practical Implications

For DevOps and CI pipelines: Integrating EET can shrink the bill for automated code‑review bots or “AI‑pair‑programmer” services, making large‑scale roll‑outs financially viable.
For SaaS providers of AI‑assisted debugging: Offering a cost‑transparent tier (e.g., “pay‑per‑issue”) becomes easier when the backend can guarantee ≤ 30 % lower token consumption per request.
For open‑source contributors: The released experience dataset can be reused to bootstrap early‑termination heuristics for new languages or domain‑specific tools (e.g., security‑focused patch generation).
For developers: Faster turnaround times—fewer API round‑trips mean lower latency, which translates to a smoother interactive experience when using AI‑driven IDE extensions.

Limitations & Future Work

Experience bias: EET relies on historical issue logs; if the training data is skewed toward certain bug patterns, the classifier may prematurely abort novel but solvable cases.
Granularity of early‑termination signals: The current rule set uses relatively simple features; richer semantic embeddings could capture subtler cues.
Scalability to massive codebases: The study focuses on single‑file patches; extending the approach to multi‑module refactorings may require more sophisticated termination criteria.
Future directions: The authors suggest (1) incorporating reinforcement‑learning loops to continuously refine termination policies, (2) exploring cross‑project transfer learning for experience sharing, and (3) evaluating EET in real‑time developer workflows beyond benchmark suites.

Authors

Yaoqi Guo
Ying Xiao
Jie M. Zhang
Mark Harman
Yiling Lou
Yang Liu
Zhenpeng Chen

Paper Information

arXiv ID: 2601.05777v1
Categories: cs.SE
Published: January 9, 2026
PDF: Download PDF

[Paper] EET: Experience-Driven Early Termination for Cost-Efficient Software Engineering Agents

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] SSR: Safeguarding Staking Rewards by Defining and Detecting Logical Defects in DeFi Staking

[Paper] StriderSPD: Structure-Guided Joint Representation Learning for Binary Security Patch Detection

[Paper] From Issues to Insights: RAG-based Explanation Generation from Software Engineering Artifacts

[Paper] Drivora: A Unified and Extensible Infrastructure for Search-based Autonomous Driving Testing