[Paper] How Do Agents Perform Code Optimization? An Empirical Study

Published: (December 25, 2025 at 01:20 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.21757v1

Overview

Performance optimization is a perennial pain point for developers, and the rise of AI coding assistants promises to ease the burden. This paper delivers the first large‑scale, data‑driven comparison of how AI agents and human engineers tackle real‑world performance‑boosting pull requests (PRs). By mining 324 AI‑generated and 83 human‑written PRs from the AIDev dataset, the authors shed light on adoption patterns, code quality, optimization tactics, and validation practices—offering a reality check on the current state of “agentic” code optimization.

Key Contributions

  • Empirical benchmark of AI‑generated vs. human‑authored performance‑optimization PRs across 407 real‑world commits.
  • Quantitative analysis of adoption rates, maintainability metrics, and the prevalence of different optimization patterns (e.g., algorithmic swaps, data‑structure changes, caching).
  • Validation gap discovery: AI PRs include explicit performance tests in only 45.7 % of cases, versus 63.6 % for humans (statistically significant, p = 0.007).
  • Pattern similarity finding: despite the validation gap, AI agents largely mimic the same optimization idioms that human developers use.
  • Actionable discussion of current limitations and research directions for more reliable, self‑validating AI code optimizers.

Methodology

  1. Dataset construction – The authors leveraged the publicly available AIDev repository, extracting PRs labeled as “performance” and separating them by author type (AI agent vs. human).
  2. Manual labeling & verification – Each PR was inspected to confirm that the change was genuinely performance‑focused and to record the validation approach (benchmark, profiling, or none).
  3. Metric extraction – For every PR, they measured:
    • Adoption: whether the PR was merged.
    • Maintainability: cyclomatic complexity, lines added/removed, and code churn.
    • Optimization patterns: categorized into algorithmic, data‑structure, caching, parallelism, etc.
  4. Statistical analysis – Chi‑square and Mann‑Whitney U tests evaluated differences between AI and human groups, with a significance threshold of p < 0.05.

The pipeline is deliberately lightweight so that developers can reproduce or extend the study on their own codebases.

Results & Findings

AspectAI‑generated PRsHuman‑authored PRsKey Insight
Merge rate71 %78 %Humans still enjoy a modest edge in acceptance.
Explicit performance validation45.7 %63.6 %AI agents often skip benchmarks or profiling, raising reliability concerns.
Maintainability (avg. cyclomatic complexity change)+0.8+0.5AI changes are slightly more complex, but not dramatically so.
Dominant optimization patternsAlgorithmic swap (34 %), caching (22 %), data‑structure change (18 %)Same top three patterns, with similar relative frequencies.AI agents have learned the “right” idioms from existing code.
Common pitfallsOver‑caching leading to memory bloat, missing edge‑case handlingRarely observedHighlights a need for better holistic testing.

Overall, AI agents can produce performance‑improving commits that look syntactically and stylistically similar to human work, yet they fall short on rigorous validation and occasionally introduce subtle regressions.

Practical Implications

  • Tooling integration – Development teams can safely experiment with AI‑driven suggestions for low‑risk optimizations, but should enforce a mandatory benchmark step (e.g., CI‑based micro‑benchmarks) before merging.
  • CI/CD pipelines – Adding automated performance regression tests can bridge the validation gap identified in the study, turning AI PRs into production‑ready changes.
  • Developer workflow – Engineers can treat AI agents as “pair programmers” that propose candidate optimizations; the human reviewer’s role shifts toward confirming empirical gains rather than discovering the optimization itself.
  • Cost‑benefit – Since AI PRs have comparable merge rates and use familiar patterns, organizations may achieve faster turnaround on performance tickets, freeing senior engineers to focus on architectural work.
  • Education & onboarding – New hires can learn common optimization idioms by reviewing AI‑generated PRs, which act as a curated repository of best‑practice patterns.

Limitations & Future Work

  • Dataset bias – The AIDev corpus leans toward open‑source projects with active AI experimentation; results may differ in enterprise or legacy codebases.
  • Agent diversity – The study aggregates multiple AI agents under a single “AI” label, obscuring performance differences between, e.g., Codex‑based vs. GPT‑4‑based assistants.
  • Validation granularity – The binary “explicit validation” metric does not capture the quality or thoroughness of the benchmarks used.
  • Future directions suggested by the authors include: building agents that automatically generate and run performance tests, expanding the study to cover memory and energy optimizations, and exploring reinforcement‑learning loops where agents learn from failed PRs.

Authors

  • Huiyun Peng
  • Antonio Zhong
  • Ricardo Andrés Calvo Méndez
  • Kelechi G. Kalu
  • James C. Davis

Paper Information

  • arXiv ID: 2512.21757v1
  • Categories: cs.SE, cs.AI
  • Published: December 25, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »