[Paper] How Do Agentic AI Systems Address Performance Optimizations? A BERTopic-Based Analysis of Pull Requests

Published: (December 31, 2025 at 12:06 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.24630v1

Overview

The paper investigates how AI‑driven coding assistants (e.g., GitHub Copilot, ChatGPT‑based bots) actually handle performance‑related changes in real pull requests (PRs). By mining thousands of PRs authored by AI agents and applying topic modeling, the authors reveal the kinds of optimizations AI suggests, where they appear in the software stack, and how they affect the PR review process.

Key Contributions

  • Empirical dataset of AI‑generated performance PRs – collected and filtered a large corpus of pull requests created by LLM‑powered agents.
  • LLM‑assisted detection pipeline – used a small prompting strategy to automatically label PRs as “performance‑related” with high precision.
  • BERTopic‑based taxonomy – uncovered 52 fine‑grained performance topics, organized into 10 high‑level categories (e.g., algorithmic improvements, memory usage, I/O tuning).
  • Quantitative link to review outcomes – demonstrated that certain optimization types lead to higher acceptance rates and shorter review cycles, while others stall.
  • Lifecycle insight – showed that AI agents concentrate performance work during initial development rather than ongoing maintenance.

Methodology

  1. Data collection – scraped PRs from popular open‑source repositories that explicitly attribute the author to an AI bot (e.g., github-actions[bot], copilot[bot]).
  2. Performance‑PR identification – crafted a few‑shot prompt for a state‑of‑the‑art LLM to classify PR titles, descriptions, and diff comments as performance‑focused. The model’s predictions were then manually verified on a random sample to ensure quality.
  3. Topic modeling with BERTopic – fed the textual content (titles, bodies, review comments) of the filtered PRs into BERTopic, which combines transformer embeddings with clustering to surface coherent topics. The resulting 52 topics were manually grouped into 10 broader categories.
  4. Statistical analysis – correlated each topic/category with PR acceptance (merged vs. closed) and review time (submission to merge/close) using logistic regression and survival analysis, controlling for repo size, language, and contributor experience.

Results & Findings

  • Diverse optimization layers – AI agents propose changes across the stack: algorithmic refactors (28 % of PRs), data‑structure swaps (15 %), caching strategies (12 %), async/I/O adjustments (10 %), and low‑level memory or compiler flags (5 %).
  • Impact on acceptance – PRs that address algorithmic inefficiencies have the highest merge rate (≈ 73 %) and the shortest median review time (1.8 days). In contrast, memory‑management tweaks are merged only 41 % of the time and linger for ~4.2 days.
  • Development vs. maintenance – 68 % of AI‑generated performance PRs appear within the first 30 % of a repository’s commit history (i.e., early development). Only 12 % surface in long‑running maintenance cycles.
  • Reviewer sentiment – Human reviewers often request additional benchmarks for caching and async changes, suggesting a trust gap for optimizations that are less “obviously correct.”

Practical Implications

  • Tool builders – The taxonomy can guide LLM fine‑tuning: prioritize algorithmic and I/O patterns where AI already shows high acceptance, and invest in better justification (e.g., auto‑generated benchmarks) for memory‑heavy tweaks.
  • DevOps pipelines – Integrate automated performance regression tests triggered by AI‑generated PRs; the study shows that lack of evidence is a primary cause of delayed reviews.
  • Project maintainers – Expect AI agents to be most helpful early in a project’s lifecycle; schedule dedicated “AI‑optimization sprints” when onboarding new codebases.
  • Developer education – Understanding which optimization categories AI excels at can help developers write clearer prompts (e.g., “suggest a faster sorting algorithm”) and review AI suggestions more efficiently.

Limitations & Future Work

  • Bot attribution bias – The dataset only includes PRs that explicitly label an AI bot as the author, potentially missing hybrid human‑AI contributions.
  • Language & ecosystem focus – The majority of PRs come from JavaScript/TypeScript and Python projects; results may differ for systems languages like Rust or Go.
  • Static analysis only – The study relies on textual cues and does not execute the proposed changes; future work could incorporate runtime profiling to validate actual performance gains.
  • User intent – The LLM classifier may mislabel PRs that mention “performance” in a non‑technical context; refining prompts and expanding the training set could improve precision.

Overall, the paper provides a data‑driven lens on how current agentic AI systems tackle performance, offering actionable insights for both tool developers and software teams looking to harness AI for faster, leaner code.

Authors

  • Md Nahidul Islam Opu
  • Shahidul Islam
  • Muhammad Asaduzzaman
  • Shaiful Chowdhury

Paper Information

  • arXiv ID: 2512.24630v1
  • Categories: cs.SE
  • Published: December 31, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »