[Paper] EET: Experience-Driven Early Termination for Cost-Efficient Software Engineering Agents
Source: arXiv - 2601.05777v1
Overview
Large‑language‑model (LLM)‑powered software‑engineering (SE) agents are becoming everyday tools for developers—automatically generating patches, triaging bugs, and suggesting refactorings. However, each API call can cost a few cents, and a single issue often requires many back‑and‑forth iterations, inflating the total expense. The paper “EET: Experience‑Driven Early Termination for Cost‑Efficient Software Engineering Agents” proposes a lightweight, data‑driven technique that cuts those costs dramatically while keeping the agents’ success rate virtually unchanged.
Key Contributions
- Experience‑driven early termination (EET): A framework that mines structured “experience” from previously solved issues and uses it to decide, mid‑generation, whether continuing an iteration is unlikely to improve the outcome.
- Cross‑agent applicability: Demonstrated on three distinct SE agents (e.g., Codex‑based, GPT‑4‑based, and a fine‑tuned CodeGen model) without requiring model retraining.
- Significant cost savings: Achieves a 19 %–55 % reduction in total monetary cost (average 32 %) on the SWE‑bench Verified benchmark, with at most a 0.2 % drop in issue‑resolution rate.
- Token‑level efficiency: Cuts API calls by 21 %, input tokens by 30 %, and output tokens by 25 % on average, directly translating to lower cloud‑provider bills.
- Open‑source release: All code, prompts, and the curated experience dataset are publicly available, enabling immediate adoption and further research.
Methodology
-
Collecting Experience:
- For each resolved issue, the system records a structured log: the sequence of prompts, model outputs, and the final verdict (whether the patch fixed the bug).
- These logs are abstracted into experience tuples (e.g., “patch‑generation step X with token count Y rarely leads to success for language X”).
-
Learning Early‑Termination Rules:
- A lightweight classifier (e.g., a decision tree) is trained on the experience tuples to predict the utility of continuing an iteration.
- The classifier operates on inexpensive features such as the number of generated tokens so far, similarity to previously successful patches, and confidence scores from the LLM.
-
Runtime Integration:
- During a new issue‑resolution session, after each generation step the agent queries the classifier.
- If the classifier signals a low probability of improvement, the session is terminated early, and the best‑so‑far patch is either accepted or discarded based on a simple quality check.
-
Evaluation Setup:
- Experiments run on the SWE‑bench Verified suite (a collection of real‑world GitHub issues with ground‑truth patches).
- Three representative agents are evaluated: a baseline Codex‑style model, a GPT‑4‑style model, and a fine‑tuned CodeGen model.
- Metrics include total monetary cost (derived from token usage), resolution rate, number of API calls, and token counts.
Results & Findings
| Metric | Baseline | EET‑enabled | Improvement |
|---|---|---|---|
| Total cost | 1.00× | 0.68× (average) | ‑32 % (range 19‑55 %) |
| Resolution rate | 71.3 % | 71.1 % | ‑0.2 % (negligible) |
| Early‑termination hits | — | 11 % of issues | – |
| API calls | 100 % | 79 % | ‑21 % |
| Input tokens | 100 % | 70 % | ‑30 % |
| Output tokens | 100 % | 75 % | ‑25 % |
Key takeaways
- Cost savings are achieved by stopping unproductive loops early, not by compromising the quality of the final patch.
- The approach works uniformly across different LLM back‑ends, indicating that the experience‑driven signals are model‑agnostic.
- Even though early termination only triggers for roughly one‑tenth of the issues, the cumulative token reduction is substantial because those cases tend to be the most token‑heavy.
Practical Implications
- For DevOps and CI pipelines: Integrating EET can shrink the bill for automated code‑review bots or “AI‑pair‑programmer” services, making large‑scale roll‑outs financially viable.
- For SaaS providers of AI‑assisted debugging: Offering a cost‑transparent tier (e.g., “pay‑per‑issue”) becomes easier when the backend can guarantee ≤ 30 % lower token consumption per request.
- For open‑source contributors: The released experience dataset can be reused to bootstrap early‑termination heuristics for new languages or domain‑specific tools (e.g., security‑focused patch generation).
- For developers: Faster turnaround times—fewer API round‑trips mean lower latency, which translates to a smoother interactive experience when using AI‑driven IDE extensions.
Limitations & Future Work
- Experience bias: EET relies on historical issue logs; if the training data is skewed toward certain bug patterns, the classifier may prematurely abort novel but solvable cases.
- Granularity of early‑termination signals: The current rule set uses relatively simple features; richer semantic embeddings could capture subtler cues.
- Scalability to massive codebases: The study focuses on single‑file patches; extending the approach to multi‑module refactorings may require more sophisticated termination criteria.
- Future directions: The authors suggest (1) incorporating reinforcement‑learning loops to continuously refine termination policies, (2) exploring cross‑project transfer learning for experience sharing, and (3) evaluating EET in real‑time developer workflows beyond benchmark suites.
Authors
- Yaoqi Guo
- Ying Xiao
- Jie M. Zhang
- Mark Harman
- Yiling Lou
- Yang Liu
- Zhenpeng Chen
Paper Information
- arXiv ID: 2601.05777v1
- Categories: cs.SE
- Published: January 9, 2026
- PDF: Download PDF