[Paper] Early-Stage Prediction of Review Effort in AI-Generated Pull Requests

Published: (January 2, 2026 at 12:18 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.00753v1

Overview

The paper investigates a fresh problem that emerges as AI agents move from simple code‑completion helpers to autonomous contributors that can open pull requests (PRs) on their own. By analyzing more than 33 k AI‑generated PRs, the authors ask: Can we predict, right when a PR is created, whether it will require a lot of human review effort? Their answer is a high‑accuracy “circuit‑breaker” model that flags the most costly PRs using only static code‑structure signals.

Key Contributions

  • Empirical discovery of two distinct PR regimes for AI agents: (1) instant‑merge PRs (≈28 % of all PRs) and (2) iterative, “ghosted” PRs that stall and demand heavy review.
  • Large‑scale dataset: 33,707 agent‑authored PRs from 2,807 open‑source repositories (AIDev dataset).
  • Circuit Breaker triage model: a lightweight LightGBM classifier that predicts the top‑20 % most review‑intensive PRs at creation time using only static structural features (e.g., number of files changed, diff size, language composition).
  • Performance results: AUC = 0.957 on a temporal hold‑out split; intercepts 69 % of total review effort while consuming only 20 % of the review budget.
  • Insight on feature importance: semantic text features (TF‑IDF, CodeBERT embeddings) add virtually no predictive power compared to structural metrics, overturning the assumption that “what the AI says” matters most.

Methodology

  1. Data collection – Extracted all PRs authored by AI agents (identified via the author_association field and known bot accounts) from the AIDev dataset. Each PR was enriched with static metadata (files touched, lines added/deleted, language mix) and dynamic review metrics (time to first comment, number of review rounds, total reviewer time).
  2. Labeling effort – Review effort was quantified by aggregating reviewer time and comment count. PRs were ranked, and the top 20 % were labeled “high‑effort.”
  3. Feature engineering – Built two feature families:
    • Structural: diff size, number of files, proportion of test vs. production code, language diversity, presence of large binary files, etc.
    • Semantic: TF‑IDF vectors of PR titles/descriptions and CodeBERT embeddings of changed code snippets.
  4. Model training – Used LightGBM (gradient‑boosted trees) with a temporal split (train on older PRs, test on newer ones) to simulate real‑world deployment. Hyperparameters were tuned via Bayesian optimization.
  5. Evaluation – Primary metric: Area Under the ROC Curve (AUC). Secondary metrics: precision@20 % budget, recall of total review effort captured, and feature importance analysis.

Results & Findings

MetricValue
AUC (temporal split)0.957
Precision @ 20 % budget0.71
Recall of total review effort (captured)69 %
Feature impact (top 5)Diff size, number of files, proportion of test files, language count, presence of generated files
Semantic features contribution< 2 % improvement over structural baseline
  • Two‑regime behavior: 28.3 % of PRs merged instantly (≤ 1 min), indicating successful narrow‑automation tasks. The remaining PRs often entered “ghosting” loops where the AI stopped responding, forcing reviewers to intervene heavily.
  • Structural dominance: Simple metrics about what the AI touched (size, breadth, file types) were far more predictive than any analysis of the PR’s textual description or code semantics.
  • Zero‑latency governance: Deploying the circuit‑breaker model as a pre‑merge gate can automatically reject or flag high‑effort PRs, allowing teams to allocate reviewer time more efficiently.

Practical Implications

  • Automated triage pipelines – Teams can integrate the LightGBM model into CI/CD to automatically label or block AI‑generated PRs that are likely to be review‑heavy, reducing noise in the review queue.
  • Resource budgeting – By allocating a fixed “review budget” (e.g., 20 % of reviewer capacity) to flagged PRs, organizations can capture the majority of review effort while keeping the rest of the workflow lightweight.
  • Design of AI agents – Since structural impact drives effort, developers of AI code‑generation tools should prioritize generating smaller, more focused diffs and avoid touching many unrelated files.
  • Policy & governance – The “circuit breaker” concept provides a concrete governance mechanism for human‑AI collaboration, enabling zero‑latency enforcement of quality gates without manual oversight.
  • Tooling extensions – IDE plugins or GitHub Apps could surface the model’s confidence score directly on PR creation, giving reviewers early visibility into potential workload.

Limitations & Future Work

  • Dataset bias – The study focuses on open‑source repositories and a specific set of AI agents; results may differ for private codebases or newer generation models.
  • Feature scope – Only static structural features were considered; future work could explore dynamic runtime metrics (e.g., test failures) to refine predictions.
  • Model interpretability – While feature importance is reported, deeper causal analysis (e.g., why certain file types trigger higher effort) remains open.
  • Human factors – The impact of reviewer expertise, team size, and cultural practices on effort was not modeled; incorporating these could improve real‑world applicability.
  • Adaptive agents – Investigating how agents could self‑regulate (e.g., automatically split large PRs) based on the model’s feedback is a promising direction.

Authors

  • Dao Sy Duy Minh
  • Huynh Trung Kiet
  • Tran Chi Nguyen
  • Nguyen Lam Phu Quy
  • Phu Hoa Pham
  • Nguyen Dinh Ha Duong
  • Truong Bao Tran

Paper Information

  • arXiv ID: 2601.00753v1
  • Categories: cs.SE
  • Published: January 2, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »