[Paper] Comparing AI Coding Agents: A Task-Stratified Analysis of Pull Request Acceptance

Published: (February 9, 2026 at 12:14 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.08915v1

Overview

A new empirical study puts the most‑used AI coding assistants—OpenAI Codex, GitHub Copilot, Devin, Cursor, and Claude Code—under the microscope by examining 7,156 pull requests (PRs) from the AIDev dataset. By stratifying PRs by task type (e.g., documentation, new feature, bug‑fix) and tracking acceptance rates over 32 weeks, the authors reveal which agents truly help developers get code merged and where their strengths and weaknesses lie.

Key Contributions

  • Large‑scale, task‑stratified comparison of five AI coding agents on real‑world PRs (7,156 PRs, 9 task categories).
  • Temporal trend analysis showing that only Devin improves its acceptance rate over time (+0.77 % per week).
  • Quantification of task impact: documentation PRs enjoy an 82.1 % acceptance rate versus 66.1 % for new‑feature PRs—a 16‑point gap that dwarfs most inter‑agent differences.
  • Statistical validation using stratified Chi‑square tests that confirm Codex’s overall advantage and highlight task‑specific leaders (Claude Code for documentation & features, Cursor for bug‑fixes).
  • Open dataset and analysis scripts released for reproducibility and further community research.

Methodology

  1. Data collection – The authors mined the publicly available AIDev dataset, extracting every PR generated with the help of one of the five agents. Each PR was labeled with its originating agent and its primary task type (documentation, feature, fix, refactor, test, etc.).
  2. Task stratification – PRs were grouped into nine mutually exclusive categories based on the change description and code diff, allowing a “apples‑to‑apples” comparison across agents.
  3. Acceptance metric – A PR is considered accepted if it is merged within 30 days of opening. Acceptance rates were computed per agent, per task, and per week.
  4. Temporal analysis – Weekly acceptance rates were plotted and a linear regression was fitted to capture trends.
  5. Statistical testing – For each task category, a Chi‑square test of independence (with Bonferroni correction) examined whether acceptance rates differed significantly among agents.
  6. Reproducibility – All preprocessing scripts, statistical code, and the curated subset of the AIDev dataset are hosted on GitHub under an open‑source license.

Results & Findings

Task CategoryBest‑performing AgentAcceptance Rate
DocumentationClaude Code92.3 %
New FeatureClaude Code72.6 %
Bug FixCursor80.4 %
Overall (all tasks)OpenAI Codex59.6 %–88.6 % (high across the board)
  • Temporal trends: Devin is the only agent with a statistically significant upward slope (+0.77 % per week). All others hover around a flat line, indicating stable but not improving performance.
  • Task dominance: The type of work being done explains far more variance in acceptance than the choice of agent. Documentation PRs are 16 percentage points more likely to be merged than new‑feature PRs, regardless of the assistant used.
  • Statistical significance: Stratified Chi‑square tests (p < 0.01 after correction) show Codex outperforms peers in five of the nine categories, while Claude Code and Cursor hold significant advantages in their respective niches.
  • No universal winner: No single assistant dominates across all tasks; developers may need to switch tools depending on the work they are tackling.

Practical Implications

  • Tool selection by task – Teams can adopt a “best‑of‑both‑worlds” strategy: use Claude Code for documentation and feature scaffolding, switch to Cursor when tackling bug‑fixes, and keep Codex as a reliable fallback for mixed workloads.
  • Continuous monitoring – Since acceptance rates evolve (e.g., Devin’s upward trend), organizations should periodically re‑evaluate their AI assistants rather than assuming static performance.
  • Process optimization – Knowing that documentation changes are far more likely to be merged, developers can prioritize AI‑generated docs to accelerate review cycles and reduce technical debt.
  • Integration pipelines – CI/CD setups can be enriched with agent‑specific linting or quality gates; for instance, enforce higher test coverage when a fix is generated by Cursor, which already shows strong acceptance in that domain.
  • Cost‑benefit analysis – By mapping acceptance probability to developer time saved, product managers can quantify ROI for each assistant per task type, guiding subscription or licensing decisions.

Limitations & Future Work

  • Dataset bias – The AIDev dataset primarily contains open‑source projects; results may differ in proprietary or highly regulated codebases.
  • Binary acceptance metric – Merging within 30 days does not capture post‑merge quality issues (e.g., regressions) that could affect long‑term productivity.
  • Agent versioning – The study treats each assistant as a static entity, yet many providers roll out frequent model updates that could shift performance.
  • Future directions – Extending the analysis to include code quality metrics (cyclomatic complexity, test coverage), exploring multi‑agent collaboration (e.g., chaining Codex and Cursor), and conducting user‑experience surveys to complement the quantitative acceptance data.

Authors

  • Giovanni Pinna
  • Jingzhi Gong
  • David Williams
  • Federica Sarro

Paper Information

  • arXiv ID: 2602.08915v1
  • Categories: cs.SE
  • Published: February 9, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »