[Paper] Coding Agents Don't Know When to Act
Source: arXiv - 2605.07769v1
Overview
Large language model (LLM)‑based coding agents are being rolled out to automatically triage bug reports and generate patches. But what happens when a reported issue has already been fixed? This paper introduces FixedBench, a benchmark that measures whether current agents can recognise “no‑action‑needed” situations and abstain from making unnecessary changes—a crucial ability to avoid needless technical debt.
Key Contributions
- FixedBench dataset: 200 real‑world bug‑report tasks manually verified to require no code change.
- Systematic evaluation of five recent LLM coding agents (e.g., GPT‑4, Claude, CodeLlama) across four popular agent harnesses (Auto‑GPT, ReAct, etc.).
- Quantitative evidence of an “action bias”: state‑of‑the‑art models propose spurious patches in 35‑65 % of cases.
- Analysis of instruction engineering: prompting agents to first reproduce the bug reduces false patches but creates a new failure mode (over‑abstention when the bug is partially fixed).
- Insight into training objectives: the tendency to act stems from implicit reward structures that favour any output over explicit inaction.
Methodology
- Task collection – The authors mined open‑source repositories for bug reports that were later closed as “already fixed” or “duplicate”. Each report was manually inspected to confirm that the current codebase already satisfied the reported behavior.
- Benchmark construction – For each of the 200 reports, they packaged the bug description, the relevant code snippet, and a “no‑change” ground truth.
- Agent harnesses – Four widely used orchestration frameworks (e.g., ReAct, Reflexion, Auto‑GPT, and a custom “patch‑only” harness) were used to run each of the five LLMs.
- Prompt variants – Baseline prompts asked the model to generate a patch directly. A second variant added an explicit “reproduce the issue first” instruction.
- Evaluation metrics –
- Inaction rate: proportion of runs where the agent correctly outputs “no change”.
- False‑patch rate: proportion of runs where the agent proposes any code modification (excluding harmless test or doc updates).
- Partial‑abstention: cases where the agent refuses to patch even though a minor change would still be needed (observed with the reproduce‑first prompt).
Results & Findings
| Model | Harness | False‑patch rate (baseline) | False‑patch rate (reproduce‑first) |
|---|---|---|---|
| GPT‑4 | ReAct | 38 % | 22 % |
| Claude‑2 | Auto‑GPT | 45 % | 28 % |
| CodeLlama‑34B | Reflexion | 52 % | 31 % |
| … | … | … | … |
- Even the best‑performing configuration still generated an unnecessary patch in ~35 % of the “no‑action” cases.
- Adding a “reproduce first” instruction cuts the false‑patch rate by roughly 10‑15 %, but introduces partial‑abstention in ~12 % of tasks where a small fix would still be appropriate.
- Across all models, the dominant failure mode is action bias: the agent prefers to output something rather than explicitly stating “nothing to do”.
- Qualitative analysis shows agents often suggest cosmetic changes (e.g., formatting, comment updates) that technically modify the repository and could trigger CI pipelines or code‑review churn.
Practical Implications
- CI/CD pipelines: Automated agents that indiscriminately submit patches can flood pull‑request queues, increase review load, and inflate technical debt.
- Developer trust: Frequent false positives erode confidence in AI‑assisted debugging tools, leading teams to disable or heavily gate them.
- Tool design: Harnesses should incorporate an explicit “no‑op” path (e.g., a “return‑no‑change” token) and treat it as a success criterion rather than a failure.
- Prompt engineering: Simple “reproduce‑first” instructions help but are not a silver bullet; developers may need to supply higher‑level policies (e.g., “only submit a patch if the test suite fails after reproduction”).
- Model training: Future LLM fine‑tuning should include negative examples where the correct answer is inaction, possibly via reinforcement learning with a reward that penalises unnecessary edits.
Limitations & Future Work
- Dataset scope – FixedBench focuses on single‑file bug reports from open‑source projects; multi‑module or performance‑related bugs may behave differently.
- Agent diversity – Only five LLMs and four harnesses were tested; newer models or custom fine‑tuned agents could exhibit different biases.
- Evaluation granularity – The benchmark treats any code change (including harmless formatting) as a failure, which may overstate impact for some workflows.
- Future directions suggested by the authors include: expanding FixedBench to cover larger codebases, integrating reinforcement‑learning‑from‑human‑feedback (RLHF) that rewards correct abstention, and designing harnesses that can query a “confidence‑threshold” before committing a patch.
Authors
- Thibaud Gloaguen
- Niels Mündler
- Mark Müller
- Veselin Raychev
- Martin Vechev
Paper Information
- arXiv ID: 2605.07769v1
- Categories: cs.SE
- Published: May 8, 2026
- PDF: Download PDF