[Paper] Coding Agents Don't Know When to Act

Published: 3 days ago (May 8, 2026 at 10:10 AM EDT)

4 min read

Source: arXiv

Source: arXiv - 2605.07769v1

Overview

Large language model (LLM)‑based coding agents are being rolled out to automatically triage bug reports and generate patches. But what happens when a reported issue has already been fixed? This paper introduces FixedBench, a benchmark that measures whether current agents can recognise “no‑action‑needed” situations and abstain from making unnecessary changes—a crucial ability to avoid needless technical debt.

Key Contributions

FixedBench dataset: 200 real‑world bug‑report tasks manually verified to require no code change.
Systematic evaluation of five recent LLM coding agents (e.g., GPT‑4, Claude, CodeLlama) across four popular agent harnesses (Auto‑GPT, ReAct, etc.).
Quantitative evidence of an “action bias”: state‑of‑the‑art models propose spurious patches in 35‑65 % of cases.
Analysis of instruction engineering: prompting agents to first reproduce the bug reduces false patches but creates a new failure mode (over‑abstention when the bug is partially fixed).
Insight into training objectives: the tendency to act stems from implicit reward structures that favour any output over explicit inaction.

Methodology

Task collection – The authors mined open‑source repositories for bug reports that were later closed as “already fixed” or “duplicate”. Each report was manually inspected to confirm that the current codebase already satisfied the reported behavior.
Benchmark construction – For each of the 200 reports, they packaged the bug description, the relevant code snippet, and a “no‑change” ground truth.
Agent harnesses – Four widely used orchestration frameworks (e.g., ReAct, Reflexion, Auto‑GPT, and a custom “patch‑only” harness) were used to run each of the five LLMs.
Prompt variants – Baseline prompts asked the model to generate a patch directly. A second variant added an explicit “reproduce the issue first” instruction.
Evaluation metrics –
- Inaction rate: proportion of runs where the agent correctly outputs “no change”.
- False‑patch rate: proportion of runs where the agent proposes any code modification (excluding harmless test or doc updates).
- Partial‑abstention: cases where the agent refuses to patch even though a minor change would still be needed (observed with the reproduce‑first prompt).

Results & Findings

Model	Harness	False‑patch rate (baseline)	False‑patch rate (reproduce‑first)
GPT‑4	ReAct	38 %	22 %
Claude‑2	Auto‑GPT	45 %	28 %
CodeLlama‑34B	Reflexion	52 %	31 %
…	…	…	…

Even the best‑performing configuration still generated an unnecessary patch in ~35 % of the “no‑action” cases.
Adding a “reproduce first” instruction cuts the false‑patch rate by roughly 10‑15 %, but introduces partial‑abstention in ~12 % of tasks where a small fix would still be appropriate.
Across all models, the dominant failure mode is action bias: the agent prefers to output something rather than explicitly stating “nothing to do”.
Qualitative analysis shows agents often suggest cosmetic changes (e.g., formatting, comment updates) that technically modify the repository and could trigger CI pipelines or code‑review churn.

Practical Implications

CI/CD pipelines: Automated agents that indiscriminately submit patches can flood pull‑request queues, increase review load, and inflate technical debt.
Developer trust: Frequent false positives erode confidence in AI‑assisted debugging tools, leading teams to disable or heavily gate them.
Tool design: Harnesses should incorporate an explicit “no‑op” path (e.g., a “return‑no‑change” token) and treat it as a success criterion rather than a failure.
Prompt engineering: Simple “reproduce‑first” instructions help but are not a silver bullet; developers may need to supply higher‑level policies (e.g., “only submit a patch if the test suite fails after reproduction”).
Model training: Future LLM fine‑tuning should include negative examples where the correct answer is inaction, possibly via reinforcement learning with a reward that penalises unnecessary edits.

Limitations & Future Work

Dataset scope – FixedBench focuses on single‑file bug reports from open‑source projects; multi‑module or performance‑related bugs may behave differently.
Agent diversity – Only five LLMs and four harnesses were tested; newer models or custom fine‑tuned agents could exhibit different biases.
Evaluation granularity – The benchmark treats any code change (including harmless formatting) as a failure, which may overstate impact for some workflows.
Future directions suggested by the authors include: expanding FixedBench to cover larger codebases, integrating reinforcement‑learning‑from‑human‑feedback (RLHF) that rewards correct abstention, and designing harnesses that can query a “confidence‑threshold” before committing a patch.

Authors

Thibaud Gloaguen
Niels Mündler
Mark Müller
Veselin Raychev
Martin Vechev

Paper Information

arXiv ID: 2605.07769v1
Categories: cs.SE
Published: May 8, 2026
PDF: Download PDF

[Paper] Coding Agents Don't Know When to Act

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Collaborator or Assistnat? How AI Coding Agents Partition Work Across Pull Request Lifecycles

[Paper] Similar Pattern Annotation via Retrieval Knowledge for LLM-Based Test Code Fault Localization

[Paper] Evaluating Design Conformance Through Trace Comparison

[Paper] Unsafe by Flow: Uncovering Bidirectional Data-Flow Risks in MCP Ecosystem