[Paper] How Do Agents Perform Code Optimization? An Empirical Study

Published: 1 month ago (December 25, 2025 at 01:20 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.21757v1

Overview

Performance optimization is a perennial pain point for developers, and the rise of AI coding assistants promises to ease the burden. This paper delivers the first large‑scale, data‑driven comparison of how AI agents and human engineers tackle real‑world performance‑boosting pull requests (PRs). By mining 324 AI‑generated and 83 human‑written PRs from the AIDev dataset, the authors shed light on adoption patterns, code quality, optimization tactics, and validation practices—offering a reality check on the current state of “agentic” code optimization.

Key Contributions

Empirical benchmark of AI‑generated vs. human‑authored performance‑optimization PRs across 407 real‑world commits.
Quantitative analysis of adoption rates, maintainability metrics, and the prevalence of different optimization patterns (e.g., algorithmic swaps, data‑structure changes, caching).
Validation gap discovery: AI PRs include explicit performance tests in only 45.7 % of cases, versus 63.6 % for humans (statistically significant, p = 0.007).
Pattern similarity finding: despite the validation gap, AI agents largely mimic the same optimization idioms that human developers use.
Actionable discussion of current limitations and research directions for more reliable, self‑validating AI code optimizers.

Methodology

Dataset construction – The authors leveraged the publicly available AIDev repository, extracting PRs labeled as “performance” and separating them by author type (AI agent vs. human).
Manual labeling & verification – Each PR was inspected to confirm that the change was genuinely performance‑focused and to record the validation approach (benchmark, profiling, or none).
Metric extraction – For every PR, they measured:
- Adoption: whether the PR was merged.
- Maintainability: cyclomatic complexity, lines added/removed, and code churn.
- Optimization patterns: categorized into algorithmic, data‑structure, caching, parallelism, etc.
Statistical analysis – Chi‑square and Mann‑Whitney U tests evaluated differences between AI and human groups, with a significance threshold of p < 0.05.

The pipeline is deliberately lightweight so that developers can reproduce or extend the study on their own codebases.

Results & Findings

Aspect	AI‑generated PRs	Human‑authored PRs	Key Insight
Merge rate	71 %	78 %	Humans still enjoy a modest edge in acceptance.
Explicit performance validation	45.7 %	63.6 %	AI agents often skip benchmarks or profiling, raising reliability concerns.
Maintainability (avg. cyclomatic complexity change)	+0.8	+0.5	AI changes are slightly more complex, but not dramatically so.
Dominant optimization patterns	Algorithmic swap (34 %), caching (22 %), data‑structure change (18 %)	Same top three patterns, with similar relative frequencies.	AI agents have learned the “right” idioms from existing code.
Common pitfalls	Over‑caching leading to memory bloat, missing edge‑case handling	Rarely observed	Highlights a need for better holistic testing.

Overall, AI agents can produce performance‑improving commits that look syntactically and stylistically similar to human work, yet they fall short on rigorous validation and occasionally introduce subtle regressions.

Practical Implications

Tooling integration – Development teams can safely experiment with AI‑driven suggestions for low‑risk optimizations, but should enforce a mandatory benchmark step (e.g., CI‑based micro‑benchmarks) before merging.
CI/CD pipelines – Adding automated performance regression tests can bridge the validation gap identified in the study, turning AI PRs into production‑ready changes.
Developer workflow – Engineers can treat AI agents as “pair programmers” that propose candidate optimizations; the human reviewer’s role shifts toward confirming empirical gains rather than discovering the optimization itself.
Cost‑benefit – Since AI PRs have comparable merge rates and use familiar patterns, organizations may achieve faster turnaround on performance tickets, freeing senior engineers to focus on architectural work.
Education & onboarding – New hires can learn common optimization idioms by reviewing AI‑generated PRs, which act as a curated repository of best‑practice patterns.

Limitations & Future Work

Dataset bias – The AIDev corpus leans toward open‑source projects with active AI experimentation; results may differ in enterprise or legacy codebases.
Agent diversity – The study aggregates multiple AI agents under a single “AI” label, obscuring performance differences between, e.g., Codex‑based vs. GPT‑4‑based assistants.
Validation granularity – The binary “explicit validation” metric does not capture the quality or thoroughness of the benchmarks used.
Future directions suggested by the authors include: building agents that automatically generate and run performance tests, expanding the study to cover memory and energy optimizations, and exploring reinforcement‑learning loops where agents learn from failed PRs.

Authors

Huiyun Peng
Antonio Zhong
Ricardo Andrés Calvo Méndez
Kelechi G. Kalu
James C. Davis

Paper Information

arXiv ID: 2512.21757v1
Categories: cs.SE, cs.AI
Published: December 25, 2025
PDF: Download PDF

[Paper] How Do Agents Perform Code Optimization? An Empirical Study

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Agentic Structured Graph Traversal for Root Cause Analysis of Code-related Incidents in Cloud Applications

[Paper] Pruning as a Game: Equilibrium-Driven Sparsification of Neural Networks

[Paper] Explainable Multimodal Regression via Information Decomposition

[Paper] A2P-Vis: an Analyzer-to-Presenter Agentic Pipeline for Visual Insights Generation and Reporting