[Paper] Early-Stage Prediction of Review Effort in AI-Generated Pull Requests
Source: arXiv - 2601.00753v1
Overview
The paper investigates a fresh problem that emerges as AI agents move from simple code‑completion helpers to autonomous contributors that can open pull requests (PRs) on their own. By analyzing more than 33 k AI‑generated PRs, the authors ask: Can we predict, right when a PR is created, whether it will require a lot of human review effort? Their answer is a high‑accuracy “circuit‑breaker” model that flags the most costly PRs using only static code‑structure signals.
Key Contributions
- Empirical discovery of two distinct PR regimes for AI agents: (1) instant‑merge PRs (≈28 % of all PRs) and (2) iterative, “ghosted” PRs that stall and demand heavy review.
- Large‑scale dataset: 33,707 agent‑authored PRs from 2,807 open‑source repositories (AIDev dataset).
- Circuit Breaker triage model: a lightweight LightGBM classifier that predicts the top‑20 % most review‑intensive PRs at creation time using only static structural features (e.g., number of files changed, diff size, language composition).
- Performance results: AUC = 0.957 on a temporal hold‑out split; intercepts 69 % of total review effort while consuming only 20 % of the review budget.
- Insight on feature importance: semantic text features (TF‑IDF, CodeBERT embeddings) add virtually no predictive power compared to structural metrics, overturning the assumption that “what the AI says” matters most.
Methodology
- Data collection – Extracted all PRs authored by AI agents (identified via the
author_associationfield and known bot accounts) from the AIDev dataset. Each PR was enriched with static metadata (files touched, lines added/deleted, language mix) and dynamic review metrics (time to first comment, number of review rounds, total reviewer time). - Labeling effort – Review effort was quantified by aggregating reviewer time and comment count. PRs were ranked, and the top 20 % were labeled “high‑effort.”
- Feature engineering – Built two feature families:
- Structural: diff size, number of files, proportion of test vs. production code, language diversity, presence of large binary files, etc.
- Semantic: TF‑IDF vectors of PR titles/descriptions and CodeBERT embeddings of changed code snippets.
- Model training – Used LightGBM (gradient‑boosted trees) with a temporal split (train on older PRs, test on newer ones) to simulate real‑world deployment. Hyperparameters were tuned via Bayesian optimization.
- Evaluation – Primary metric: Area Under the ROC Curve (AUC). Secondary metrics: precision@20 % budget, recall of total review effort captured, and feature importance analysis.
Results & Findings
| Metric | Value |
|---|---|
| AUC (temporal split) | 0.957 |
| Precision @ 20 % budget | 0.71 |
| Recall of total review effort (captured) | 69 % |
| Feature impact (top 5) | Diff size, number of files, proportion of test files, language count, presence of generated files |
| Semantic features contribution | < 2 % improvement over structural baseline |
- Two‑regime behavior: 28.3 % of PRs merged instantly (≤ 1 min), indicating successful narrow‑automation tasks. The remaining PRs often entered “ghosting” loops where the AI stopped responding, forcing reviewers to intervene heavily.
- Structural dominance: Simple metrics about what the AI touched (size, breadth, file types) were far more predictive than any analysis of the PR’s textual description or code semantics.
- Zero‑latency governance: Deploying the circuit‑breaker model as a pre‑merge gate can automatically reject or flag high‑effort PRs, allowing teams to allocate reviewer time more efficiently.
Practical Implications
- Automated triage pipelines – Teams can integrate the LightGBM model into CI/CD to automatically label or block AI‑generated PRs that are likely to be review‑heavy, reducing noise in the review queue.
- Resource budgeting – By allocating a fixed “review budget” (e.g., 20 % of reviewer capacity) to flagged PRs, organizations can capture the majority of review effort while keeping the rest of the workflow lightweight.
- Design of AI agents – Since structural impact drives effort, developers of AI code‑generation tools should prioritize generating smaller, more focused diffs and avoid touching many unrelated files.
- Policy & governance – The “circuit breaker” concept provides a concrete governance mechanism for human‑AI collaboration, enabling zero‑latency enforcement of quality gates without manual oversight.
- Tooling extensions – IDE plugins or GitHub Apps could surface the model’s confidence score directly on PR creation, giving reviewers early visibility into potential workload.
Limitations & Future Work
- Dataset bias – The study focuses on open‑source repositories and a specific set of AI agents; results may differ for private codebases or newer generation models.
- Feature scope – Only static structural features were considered; future work could explore dynamic runtime metrics (e.g., test failures) to refine predictions.
- Model interpretability – While feature importance is reported, deeper causal analysis (e.g., why certain file types trigger higher effort) remains open.
- Human factors – The impact of reviewer expertise, team size, and cultural practices on effort was not modeled; incorporating these could improve real‑world applicability.
- Adaptive agents – Investigating how agents could self‑regulate (e.g., automatically split large PRs) based on the model’s feedback is a promising direction.
Authors
- Dao Sy Duy Minh
- Huynh Trung Kiet
- Tran Chi Nguyen
- Nguyen Lam Phu Quy
- Phu Hoa Pham
- Nguyen Dinh Ha Duong
- Truong Bao Tran
Paper Information
- arXiv ID: 2601.00753v1
- Categories: cs.SE
- Published: January 2, 2026
- PDF: Download PDF