[Paper] Early-Stage Prediction of Review Effort in AI-Generated Pull Requests

Published: 1 month ago (January 2, 2026 at 12:18 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.00753v1

Overview

The paper investigates a fresh problem that emerges as AI agents move from simple code‑completion helpers to autonomous contributors that can open pull requests (PRs) on their own. By analyzing more than 33 k AI‑generated PRs, the authors ask: Can we predict, right when a PR is created, whether it will require a lot of human review effort? Their answer is a high‑accuracy “circuit‑breaker” model that flags the most costly PRs using only static code‑structure signals.

Key Contributions

Empirical discovery of two distinct PR regimes for AI agents: (1) instant‑merge PRs (≈28 % of all PRs) and (2) iterative, “ghosted” PRs that stall and demand heavy review.
Large‑scale dataset: 33,707 agent‑authored PRs from 2,807 open‑source repositories (AIDev dataset).
Circuit Breaker triage model: a lightweight LightGBM classifier that predicts the top‑20 % most review‑intensive PRs at creation time using only static structural features (e.g., number of files changed, diff size, language composition).
Performance results: AUC = 0.957 on a temporal hold‑out split; intercepts 69 % of total review effort while consuming only 20 % of the review budget.
Insight on feature importance: semantic text features (TF‑IDF, CodeBERT embeddings) add virtually no predictive power compared to structural metrics, overturning the assumption that “what the AI says” matters most.

Methodology

Data collection – Extracted all PRs authored by AI agents (identified via the author_association field and known bot accounts) from the AIDev dataset. Each PR was enriched with static metadata (files touched, lines added/deleted, language mix) and dynamic review metrics (time to first comment, number of review rounds, total reviewer time).
Labeling effort – Review effort was quantified by aggregating reviewer time and comment count. PRs were ranked, and the top 20 % were labeled “high‑effort.”
Feature engineering – Built two feature families:
- Structural: diff size, number of files, proportion of test vs. production code, language diversity, presence of large binary files, etc.
- Semantic: TF‑IDF vectors of PR titles/descriptions and CodeBERT embeddings of changed code snippets.
Model training – Used LightGBM (gradient‑boosted trees) with a temporal split (train on older PRs, test on newer ones) to simulate real‑world deployment. Hyperparameters were tuned via Bayesian optimization.
Evaluation – Primary metric: Area Under the ROC Curve (AUC). Secondary metrics: precision@20 % budget, recall of total review effort captured, and feature importance analysis.

Results & Findings

Metric	Value
AUC (temporal split)	0.957
Precision @ 20 % budget	0.71
Recall of total review effort (captured)	69 %
Feature impact (top 5)	Diff size, number of files, proportion of test files, language count, presence of generated files
Semantic features contribution	< 2 % improvement over structural baseline

Two‑regime behavior: 28.3 % of PRs merged instantly (≤ 1 min), indicating successful narrow‑automation tasks. The remaining PRs often entered “ghosting” loops where the AI stopped responding, forcing reviewers to intervene heavily.
Structural dominance: Simple metrics about what the AI touched (size, breadth, file types) were far more predictive than any analysis of the PR’s textual description or code semantics.
Zero‑latency governance: Deploying the circuit‑breaker model as a pre‑merge gate can automatically reject or flag high‑effort PRs, allowing teams to allocate reviewer time more efficiently.

Practical Implications

Automated triage pipelines – Teams can integrate the LightGBM model into CI/CD to automatically label or block AI‑generated PRs that are likely to be review‑heavy, reducing noise in the review queue.
Resource budgeting – By allocating a fixed “review budget” (e.g., 20 % of reviewer capacity) to flagged PRs, organizations can capture the majority of review effort while keeping the rest of the workflow lightweight.
Design of AI agents – Since structural impact drives effort, developers of AI code‑generation tools should prioritize generating smaller, more focused diffs and avoid touching many unrelated files.
Policy & governance – The “circuit breaker” concept provides a concrete governance mechanism for human‑AI collaboration, enabling zero‑latency enforcement of quality gates without manual oversight.
Tooling extensions – IDE plugins or GitHub Apps could surface the model’s confidence score directly on PR creation, giving reviewers early visibility into potential workload.

Limitations & Future Work

Dataset bias – The study focuses on open‑source repositories and a specific set of AI agents; results may differ for private codebases or newer generation models.
Feature scope – Only static structural features were considered; future work could explore dynamic runtime metrics (e.g., test failures) to refine predictions.
Model interpretability – While feature importance is reported, deeper causal analysis (e.g., why certain file types trigger higher effort) remains open.
Human factors – The impact of reviewer expertise, team size, and cultural practices on effort was not modeled; incorporating these could improve real‑world applicability.
Adaptive agents – Investigating how agents could self‑regulate (e.g., automatically split large PRs) based on the model’s feedback is a promising direction.

Authors

Dao Sy Duy Minh
Huynh Trung Kiet
Tran Chi Nguyen
Nguyen Lam Phu Quy
Phu Hoa Pham
Nguyen Dinh Ha Duong
Truong Bao Tran

Paper Information

arXiv ID: 2601.00753v1
Categories: cs.SE
Published: January 2, 2026
PDF: Download PDF

[Paper] Early-Stage Prediction of Review Effort in AI-Generated Pull Requests

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] SEMODS: A Validated Dataset of Open-Source Software Engineering Models

[Paper] KELP: Robust Online Log Parsing Through Evolutionary Grouping Trees

[Paper] Towards Understanding and Characterizing Vulnerabilities in Intelligent Connected Vehicles through Real-World Exploits

[Paper] STELLAR: A Search-Based Testing Framework for Large Language Model Applications