[Paper] To What Extent Does Agent-generated Code Require Maintenance? An Empirical Study
Source: arXiv - 2605.06464v1
Overview
The paper “To What Extent Does Agent‑generated Code Require Maintenance? An Empirical Study” investigates a practical question that’s buzzing in every dev‑shop that has tried LLM‑powered coding assistants: once an AI writes a file, how much work does it actually need later? By mining thousands of pull‑requests from real‑world open‑source projects, the authors compare the upkeep of AI‑generated code with that of human‑written code, shedding light on the hidden maintenance costs of today’s autonomous coding agents.
Key Contributions
- Large‑scale empirical dataset: Analyzed >1,000 AI‑generated files and ~3,200 change events across 100 popular GitHub repositories (the AIDev dataset).
- Maintenance frequency insight: AI‑generated files are updated far less often than human‑authored files, and the updates touch only a tiny portion of the file’s total lines.
- Modification type breakdown: The dominant change to AI code is feature extension (adding new functionality), whereas human‑written code is mainly bug‑fix oriented.
- Human involvement quantification: Over 90 % of the maintenance work on AI‑generated files is performed by human developers, not the agents themselves.
- Open‑source reproducibility: The authors release their data extraction scripts and annotated dataset for the community to build upon.
Methodology
- Data collection – The researchers built the AIDev dataset by scanning GitHub for pull‑requests that explicitly label AI assistance (e.g., “generated by ChatGPT”, “Copilot suggestion”). They then paired each AI‑generated file with its human‑authored counterpart in the same repository.
- Change extraction – For every file, they tracked the full commit history for one year after the initial AI‑generated commit, extracting line‑level diffs and classifying each change.
- Classification of edits – Using a lightweight taxonomy (feature extension, bug fix, refactor, documentation, style, etc.), two annotators manually labeled a random sample of changes; a trained classifier then applied the labels to the full set.
- Statistical analysis – They compared maintenance frequency (updates per month), size of change (percentage of lines touched), and actor (human vs. AI) across the two groups, employing non‑parametric tests to account for skewed distributions.
Results & Findings
| Aspect | AI‑generated code | Human‑authored code |
|---|---|---|
| Update frequency | ~0.3 updates/file/month | ~0.9 updates/file/month |
| Proportion of file changed per update | ~2 % of lines | ~7 % of lines |
| Most common edit type | Feature extensions (≈55 %) | Bug fixes (≈48 %) |
| Who does the maintenance? | Humans perform ≈94 % of edits; agents <6 % | Humans perform ≈98 % (baseline) |
| Time to first maintenance | Median 45 days after creation | Median 18 days after creation |
Interpretation: AI‑generated files tend to sit idle longer and receive only small, additive tweaks. The agents rarely come back to “clean up” or fix bugs; developers are still the primary caretakers.
Practical Implications
- Tooling strategy: Teams should treat AI‑generated snippets as starting points rather than finished components. Expect to allocate human time for later feature integration and bug‑resolution.
- Code review focus: Since AI code is less likely to be bug‑fixed later, reviewers must be extra vigilant for hidden defects at the moment of acceptance.
- Maintenance budgeting: Project managers can factor in a lower ongoing maintenance load for AI‑generated modules, but must budget for the initial integration effort and future feature extensions.
- Agent design: Developers of coding assistants might prioritize generating more robust, testable code (e.g., include unit tests) to reduce the need for later bug‑fixes that currently fall on humans.
- Policy & compliance: The fact that AI code rarely gets revisited could raise concerns for security‑critical systems; organizations may need policies mandating periodic audits of AI‑authored assets.
Limitations & Future Work
- Dataset bias: The study only covers popular open‑source projects that label AI usage, possibly overlooking private or unlabeled AI contributions.
- Short observation window: One‑year tracking may miss long‑term maintenance patterns that emerge after a project matures.
- Granularity of edit taxonomy: Some nuanced changes (e.g., performance tuning) may be mis‑classified under broader categories.
- Future directions: Extending the analysis to enterprise codebases, exploring the impact of different LLM models, and investigating how automated test generation influences downstream maintenance.
Authors
- Shota Sawada
- Tatsuya Shirai
- Yutaro Kashiwa
- Ken’ichi Yamaguchi
- Hiroshi Iwata
- Hajimu Iida
Paper Information
- arXiv ID: 2605.06464v1
- Categories: cs.SE
- Published: May 7, 2026
- PDF: Download PDF