[Paper] To What Extent Does Agent-generated Code Require Maintenance? An Empirical Study

Published: 3 days ago (May 7, 2026 at 11:52 AM EDT)

4 min read

Source: arXiv

Source: arXiv - 2605.06464v1

Overview

The paper “To What Extent Does Agent‑generated Code Require Maintenance? An Empirical Study” investigates a practical question that’s buzzing in every dev‑shop that has tried LLM‑powered coding assistants: once an AI writes a file, how much work does it actually need later? By mining thousands of pull‑requests from real‑world open‑source projects, the authors compare the upkeep of AI‑generated code with that of human‑written code, shedding light on the hidden maintenance costs of today’s autonomous coding agents.

Key Contributions

Large‑scale empirical dataset: Analyzed >1,000 AI‑generated files and ~3,200 change events across 100 popular GitHub repositories (the AIDev dataset).
Maintenance frequency insight: AI‑generated files are updated far less often than human‑authored files, and the updates touch only a tiny portion of the file’s total lines.
Modification type breakdown: The dominant change to AI code is feature extension (adding new functionality), whereas human‑written code is mainly bug‑fix oriented.
Human involvement quantification: Over 90 % of the maintenance work on AI‑generated files is performed by human developers, not the agents themselves.
Open‑source reproducibility: The authors release their data extraction scripts and annotated dataset for the community to build upon.

Methodology

Data collection – The researchers built the AIDev dataset by scanning GitHub for pull‑requests that explicitly label AI assistance (e.g., “generated by ChatGPT”, “Copilot suggestion”). They then paired each AI‑generated file with its human‑authored counterpart in the same repository.
Change extraction – For every file, they tracked the full commit history for one year after the initial AI‑generated commit, extracting line‑level diffs and classifying each change.
Classification of edits – Using a lightweight taxonomy (feature extension, bug fix, refactor, documentation, style, etc.), two annotators manually labeled a random sample of changes; a trained classifier then applied the labels to the full set.
Statistical analysis – They compared maintenance frequency (updates per month), size of change (percentage of lines touched), and actor (human vs. AI) across the two groups, employing non‑parametric tests to account for skewed distributions.

Results & Findings

Aspect	AI‑generated code	Human‑authored code
Update frequency	~0.3 updates/file/month	~0.9 updates/file/month
Proportion of file changed per update	~2 % of lines	~7 % of lines
Most common edit type	Feature extensions (≈55 %)	Bug fixes (≈48 %)
Who does the maintenance?	Humans perform ≈94 % of edits; agents <6 %	Humans perform ≈98 % (baseline)
Time to first maintenance	Median 45 days after creation	Median 18 days after creation

Interpretation: AI‑generated files tend to sit idle longer and receive only small, additive tweaks. The agents rarely come back to “clean up” or fix bugs; developers are still the primary caretakers.

Practical Implications

Tooling strategy: Teams should treat AI‑generated snippets as starting points rather than finished components. Expect to allocate human time for later feature integration and bug‑resolution.
Code review focus: Since AI code is less likely to be bug‑fixed later, reviewers must be extra vigilant for hidden defects at the moment of acceptance.
Maintenance budgeting: Project managers can factor in a lower ongoing maintenance load for AI‑generated modules, but must budget for the initial integration effort and future feature extensions.
Agent design: Developers of coding assistants might prioritize generating more robust, testable code (e.g., include unit tests) to reduce the need for later bug‑fixes that currently fall on humans.
Policy & compliance: The fact that AI code rarely gets revisited could raise concerns for security‑critical systems; organizations may need policies mandating periodic audits of AI‑authored assets.

Limitations & Future Work

Dataset bias: The study only covers popular open‑source projects that label AI usage, possibly overlooking private or unlabeled AI contributions.
Short observation window: One‑year tracking may miss long‑term maintenance patterns that emerge after a project matures.
Granularity of edit taxonomy: Some nuanced changes (e.g., performance tuning) may be mis‑classified under broader categories.
Future directions: Extending the analysis to enterprise codebases, exploring the impact of different LLM models, and investigating how automated test generation influences downstream maintenance.

Authors

Shota Sawada
Tatsuya Shirai
Yutaro Kashiwa
Ken’ichi Yamaguchi
Hiroshi Iwata
Hajimu Iida

Paper Information

arXiv ID: 2605.06464v1
Categories: cs.SE
Published: May 7, 2026
PDF: Download PDF

[Paper] To What Extent Does Agent-generated Code Require Maintenance? An Empirical Study

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Collaborator or Assistnat? How AI Coding Agents Partition Work Across Pull Request Lifecycles

[Paper] Similar Pattern Annotation via Retrieval Knowledge for LLM-Based Test Code Fault Localization

[Paper] Evaluating Design Conformance Through Trace Comparison

[Paper] Unsafe by Flow: Uncovering Bidirectional Data-Flow Risks in MCP Ecosystem