[Paper] Understanding the Limits of Automated Evaluation for Code Review Bots in Practice

Published: (April 27, 2026 at 10:25 AM EDT)
5 min read
Source: arXiv

Source: arXiv - 2604.24525v1

Overview

Automated code‑review (ACR) bots powered by large language models (LLMs) are becoming a staple in modern CI pipelines, but measuring how helpful their comments really are remains an open problem. This paper investigates whether we can reliably automate the evaluation of such bot‑generated comments using LLM‑based judges, and it uncovers why developer actions (e.g., “fixed” vs. “won’t‑fix”) are not a perfect proxy for comment quality.

Key Contributions

  • Real‑world dataset: Collected 2,604 PR comments generated by an LLM‑driven ACR bot at a large industrial partner (Beko), each manually labeled by engineers as fixed or won’t‑fix.
  • Two automated evaluation pipelines:
    1. G‑Eval – a prompt‑based binary/Likert assessment framework.
    2. LLM‑as‑Judge – a chain‑of‑thought pipeline that lets an LLM act as an adjudicator.
  • Model coverage: Experiments run on Gemini‑2.5‑pro, GPT‑4.1‑mini, and GPT‑5.2, testing both binary decisions and a 0‑4 Likert scale.
  • Empirical alignment analysis: Quantified agreement between automated judgments and human labels, revealing only moderate correlation (≈ 0.44–0.62).
  • Qualitative insight: Interviews with a software‑engineering director show that “fixed/won’t‑fix” decisions are heavily influenced by workflow pressure, prioritization, and organizational policies—not just comment correctness.

Methodology

  1. Data collection: Extracted every bot comment from PRs over several months, then asked the original developers to mark each as fixed (the suggestion was applied) or won’t‑fix (ignored or deemed irrelevant).
  2. Prompt design: For each comment, two prompt styles were crafted:
    • Binary – “Is this comment actionable? Answer Yes/No.”
    • Likert (0‑4) – “Rate the usefulness of this comment from 0 (useless) to 4 (very helpful).”
  3. Evaluation pipelines:
    • G‑Eval feeds the prompt directly to the target LLM and reads the raw answer.
    • LLM‑as‑Judge adds a reasoning step: the LLM first explains why the comment may or may not be useful, then produces the final decision.
  4. Agreement measurement: Compared the LLM output (after mapping Likert scores to binary where needed) to the human labels using simple accuracy and Cohen’s κ to capture chance‑adjusted agreement.
  5. Qualitative follow‑up: Conducted a semi‑structured interview with a senior engineering director to interpret the quantitative findings in the context of real development workflows.

Results & Findings

ModelPrompt typeAgreement (κ)Accuracy
Gemini‑2.5‑proBinary0.440.58
Gemini‑2.5‑proLikert0.510.62
GPT‑4.1‑miniBinary0.480.60
GPT‑4.1‑miniLikert0.550.64
GPT‑5.2Binary0.460.59
GPT‑5.2Likert0.570.66

Take‑away: Even the strongest LLM (GPT‑5.2) only modestly aligns with engineers’ “fixed/won’t‑fix” signals. The Likert formulation consistently outperforms the strict binary one, suggesting that a graded usefulness rating captures more nuance.

Qualitative interviews revealed that developers often ignore a bot comment not because it’s wrong, but because:

  • The change would require a large refactor that conflicts with a sprint deadline.
  • The team has an agreed‑upon style rule that makes the suggestion redundant.
  • Organizational policies (e.g., security review gates) override the bot’s recommendation.

Thus, the “ground truth” derived from developer actions is contaminated by contextual constraints, making it a shaky basis for fully automated evaluation.

Practical Implications

  • Tooling teams should treat automated evaluation as a support metric, not a definitive quality score. Use LLM‑based judges to flag potentially low‑value comments for human review rather than to auto‑accept/reject them.
  • Adopt graded (Likert‑style) feedback in CI dashboards; it yields richer signals for bot improvement pipelines.
  • Integrate contextual metadata (e.g., sprint deadline, component ownership, code‑ownership rules) into the evaluation loop to better explain why a comment was ignored.
  • Continuous learning loops: Feed the disagreement cases back into the bot’s training data, focusing on scenarios where workflow constraints dominate.
  • Developer experience: Present bot suggestions with an “impact estimate” (e.g., “would require 3 files to change”) so engineers can make informed decisions without feeling the bot is blindly demanding changes.

Limitations & Future Work

  • Dataset scope: All data come from a single organization (Beko), which may limit generalizability to open‑source or other enterprise settings.
  • Label noise: “Fixed/won’t‑fix” tags are themselves noisy proxies for comment quality; future work could collect richer annotations (e.g., multi‑dimensional usefulness scores).
  • Model diversity: Only three LLMs were tested; newer multimodal or instruction‑tuned models might behave differently.
  • Dynamic context: The study treats each comment statically; incorporating PR timeline, reviewer comments, and CI status could improve evaluation fidelity.
  • User‑centric studies: Follow‑up experiments with developers interacting with an LLM‑as‑Judge system would clarify how automated feedback influences real‑world review workflows.

Authors

  • Veli Karakaya
  • Utku Boran Torun
  • Baykal Mehmet Uçar
  • Eray Tüzün

Paper Information

  • arXiv ID: 2604.24525v1
  • Categories: cs.SE, cs.AI
  • Published: April 27, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] Recursive Multi-Agent Systems

Recursive or looped language models have recently emerged as a new scaling axis by iteratively refining the same model computation over latent states to deepen ...