[Paper] Studying Quality Improvements Recommended via Manual and Automated Code Review

Published: (February 12, 2026 at 08:23 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.11925v1

Overview

The paper investigates how well a state‑of‑the‑art large language model (ChatGPT‑4) can emulate human code‑review feedback. By comparing 739 manually written reviewer comments on 240 pull requests (PRs) with the suggestions generated automatically by ChatGPT, the authors assess whether AI can replace—or at least supplement—human reviewers in spotting quality‑related issues.

Key Contributions

  • Empirical comparison of human‑written code‑review comments and AI‑generated suggestions on a sizable real‑world dataset (240 PRs, 739 human comments).
  • Taxonomy of quality‑improvement types (e.g., naming, readability, API misuse) derived from manual inspection of human comments.
  • Quantitative analysis showing that ChatGPT produces ~2.4× more comments than humans but only captures ~10 % of the issues raised by human reviewers.
  • Qualitative insight that ~40 % of the extra AI‑generated comments are genuinely useful, highlighting a complementary relationship.
  • Guidelines for practitioners on how to integrate LLM‑based review tools into existing development workflows without expecting them to replace human judgment.

Methodology

  1. Data collection – The authors mined 240 merged PRs from popular open‑source repositories and extracted 739 comments authored by human reviewers that explicitly suggested code changes.
  2. Manual labeling – Each comment was examined and classified into a predefined set of quality‑improvement categories (e.g., naming, refactoring, documentation).
  3. LLM review generation – For every PR, the same code diff was fed to ChatGPT‑4 with a prompt asking it to perform a code review and list improvement suggestions.
  4. Comparison framework – The AI‑generated suggestions were matched against the human‑labeled issues using lexical similarity and manual verification to determine overlap, novelty, and relevance.
  5. Statistical analysis – Metrics such as recall (issues found by AI vs. human), precision (useful AI comments / total AI comments), and comment density were computed.

Results & Findings

MetricHuman ReviewChatGPT‑4 Review
Avg. comments per PR3.17.5
Overlap (issues found by both)10 % of human‑identified issues
Unique, useful AI comments~40 % of AI‑only comments
Redundant / low‑value AI comments~60 % (style nitpicks, trivial suggestions)
  • Higher volume, lower overlap: ChatGPT tends to be more talkative, flagging many superficial or already‑acceptable patterns, but it misses the majority of the nuanced problems that humans catch.
  • Complementarity: Roughly two‑thirds of the AI‑only suggestions are either duplicate of human feedback or irrelevant, yet the remaining one‑third provide fresh, actionable insights that humans did not raise.
  • No time‑saving shortcut: Because humans still need to perform the primary review and later validate AI‑generated comments, the overall review time does not shrink.

Practical Implications

  • Augmented review pipelines: Teams can run an LLM‑based reviewer as a “second pair of eyes” after the human review, catching low‑hanging quality issues that might have been overlooked.
  • Focused triage: Since only ~40 % of AI comments are useful, tooling should incorporate confidence scoring or post‑processing filters to surface the most promising suggestions.
  • Training and onboarding: New contributors could benefit from AI‑generated feedback as a learning aid, provided they are guided to distinguish high‑value advice from noise.
  • Policy design: Organizations should treat AI code‑review outputs as advisory, not authoritative, and retain mandatory human sign‑off for critical changes.
  • Tool integration: Plug‑ins for GitHub, GitLab, or Azure DevOps can automatically post ChatGPT comments on PRs, but UI/UX should allow reviewers to quickly dismiss low‑value remarks to avoid review fatigue.

Limitations & Future Work

  • Model specificity: The study only evaluates ChatGPT‑4; other LLMs or fine‑tuned models might behave differently.
  • Domain bias: The PRs come from open‑source projects with certain languages and coding styles; results may not generalize to proprietary or highly specialized codebases.
  • Prompt engineering: A single prompt was used; exploring richer prompts or multi‑turn interactions could improve AI recall.
  • Human reviewer variability: The analysis treats all human comments equally, but reviewer expertise and thoroughness vary, which could affect the baseline.

Future research directions include testing fine‑tuned LLMs on domain‑specific corpora, developing automated relevance‑filtering mechanisms for AI comments, and conducting longitudinal studies to measure how AI‑augmented reviews affect defect density and developer productivity over time.

Authors

  • Giuseppe Crupi
  • Rosalia Tufano
  • Gabriele Bavota

Paper Information

  • arXiv ID: 2602.11925v1
  • Categories: cs.SE
  • Published: February 12, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »