[Paper] CodeFuse-CommitEval: Towards Benchmarking LLM's Power on Commit Message and Code Change Inconsistency Detection

Published: 2 months ago (November 24, 2025 at 10:33 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2511.19875v1

Overview

The paper introduces CODEFUSE‑COMMITEVAL, the first benchmark that measures how well large language models (LLMs) can spot mismatches between a commit message and the actual code changes it describes—a problem known as message‑code inconsistency (MCI). By providing a curated, mutation‑based dataset and a thorough evaluation of several open‑source LLMs, the authors lay the groundwork for building tools that automatically flag misleading or low‑quality commit messages.

Key Contributions

A dedicated MCI benchmark built on the high‑quality ApacheCM dataset, covering seven systematically generated inconsistency types.
Two‑step validation pipeline that guarantees both positive (inconsistent) and negative (consistent) samples are reliable.
Comprehensive evaluation of six state‑of‑the‑art open‑source LLMs under four prompting regimes: vanilla, few‑shot, chain‑of‑thought (CoT), and extended context.
Empirical insights on how model size, prompting style, and inconsistency type affect detection performance and token consumption.
Open‑source release of the dataset, evaluation scripts, and detailed analysis to foster reproducible research.

Methodology

Data Construction
- Start with ApacheCM, a collection of real, well‑documented commits from Apache projects.
- Apply rule‑guided mutations (e.g., swapping component names, altering file paths, changing operation verbs, or misrepresenting the overall purpose) to turn a consistent commit into an inconsistent one.
- Generate seven inconsistency categories ranging from low‑level syntactic mismatches to high‑level semantic “purpose” errors.
Validation
- Positive validation: Human reviewers confirm that mutated messages truly contradict the diff.
- Negative validation: A separate check ensures the original (unaltered) commits remain consistent.
Model Evaluation
- Six open‑source LLMs (e.g., Llama‑2‑7B, Mistral‑7B, GPT‑OSS‑20B, etc.) are tested in a binary classification setting (consistent vs. inconsistent).
- Four prompting strategies are compared:
  Vanilla (single instruction), Few‑shot (few examples in the prompt), Chain‑of‑Thought (step‑by‑step reasoning), and Extended Context (including surrounding commits).
- Metrics: Recall, Precision, Specificity, and token usage.

Results & Findings

Prompting	Recall	Precision	Specificity	Avg. Tokens
Vanilla	84.1%	78.3%	61.2%	1.2k
Few‑shot	86.7%	79.5%	62.0%	0.9k
CoT	82.3%	82.9%	66.5%	1.5k
Ext. ctx	85.0%	78.0%	64.0%	1.8k

Overall detection: Models are better at flagging inconsistent commits (high recall) than confirming consistency (lower specificity).
Best performer: gpt-oss-20B achieves the highest balanced scores but consumes >2× the tokens of smaller models.
Prompt effects:
- Few‑shot improves accuracy and cuts token count but introduces a small rise in universally wrong predictions.
- Chain‑of‑Thought boosts precision and specificity, useful when false positives are costly, at the expense of recall and higher token usage.
- Adding adjacent commit context helps larger models but adds noise for the smaller ones.
Inconsistency type: Component, file‑path, and operation mismatches are detected reliably (>90% recall). “Purpose”‑level mismatches (semantic intent) remain the hardest, with recall dropping below 70% and token usage spiking.

Practical Implications

Automated Code Review: Integrating an MCI detector can surface misleading commit messages early, reducing reviewer fatigue and preventing downstream bugs.
CI/CD Gatekeeping: Teams can enforce a “commit‑message quality gate” that blocks merges when the model flags high‑confidence inconsistencies.
Dataset Curation: Researchers building mining‑software‑repositories datasets can filter out noisy commits automatically, improving downstream tasks like bug prediction or change impact analysis.
Security Audits: Since hidden security patches often hide behind vague messages, an MCI detector can act as a first‑line alert for potential stealthy changes.
Tooling Ecosystem: The benchmark and code are ready to be wrapped into VS Code extensions, Git hooks, or cloud‑based LLM services, giving developers a plug‑and‑play quality check.

Limitations & Future Work

Scope of Mutations: The benchmark relies on rule‑based mutations; real‑world inconsistencies may involve more nuanced linguistic drift that the current dataset does not capture.
Model Size Bias: Larger models consistently outperform smaller ones, suggesting that practical deployment may need cost‑aware trade‑offs or model distillation.
Context Window: Extended context helps only up to a point; future work should explore hierarchical or retrieval‑augmented approaches to provide truly relevant history without overwhelming the model.
Multi‑language Support: The current dataset is limited to Java‑centric Apache projects; expanding to other languages and ecosystems (e.g., JavaScript, Python) will test generalizability.
Human‑in‑the‑Loop: Combining model predictions with lightweight human verification could further reduce false positives, an avenue the authors plan to explore.

Authors

Qingyu Zhang
Puzhuo Liu
Peng Di
Chenxiong Qian

Paper Information

arXiv ID: 2511.19875v1
Categories: cs.SE, cs.AI
Published: November 25, 2025
PDF: Download PDF

[Paper] CodeFuse-CommitEval: Towards Benchmarking LLM's Power on Commit Message and Code Change Inconsistency Detection

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Can LLMs extract human-like fine-grained evidence for evidence-based fact-checking?

Sycophancy is the first LLM 'dark pattern'

Why AI Alignment Starts With Better Evaluation

[Paper] Escaping the Verifier: Learning to Reason via Demonstrations