[Paper] CodeFuse-CommitEval: Towards Benchmarking LLM's Power on Commit Message and Code Change Inconsistency Detection
Source: arXiv - 2511.19875v1
Overview
The paper introduces CODEFUSE‑COMMITEVAL, the first benchmark that measures how well large language models (LLMs) can spot mismatches between a commit message and the actual code changes it describes—a problem known as message‑code inconsistency (MCI). By providing a curated, mutation‑based dataset and a thorough evaluation of several open‑source LLMs, the authors lay the groundwork for building tools that automatically flag misleading or low‑quality commit messages.
Key Contributions
- A dedicated MCI benchmark built on the high‑quality ApacheCM dataset, covering seven systematically generated inconsistency types.
- Two‑step validation pipeline that guarantees both positive (inconsistent) and negative (consistent) samples are reliable.
- Comprehensive evaluation of six state‑of‑the‑art open‑source LLMs under four prompting regimes: vanilla, few‑shot, chain‑of‑thought (CoT), and extended context.
- Empirical insights on how model size, prompting style, and inconsistency type affect detection performance and token consumption.
- Open‑source release of the dataset, evaluation scripts, and detailed analysis to foster reproducible research.
Methodology
-
Data Construction
- Start with ApacheCM, a collection of real, well‑documented commits from Apache projects.
- Apply rule‑guided mutations (e.g., swapping component names, altering file paths, changing operation verbs, or misrepresenting the overall purpose) to turn a consistent commit into an inconsistent one.
- Generate seven inconsistency categories ranging from low‑level syntactic mismatches to high‑level semantic “purpose” errors.
-
Validation
- Positive validation: Human reviewers confirm that mutated messages truly contradict the diff.
- Negative validation: A separate check ensures the original (unaltered) commits remain consistent.
-
Model Evaluation
- Six open‑source LLMs (e.g., Llama‑2‑7B, Mistral‑7B, GPT‑OSS‑20B, etc.) are tested in a binary classification setting (consistent vs. inconsistent).
- Four prompting strategies are compared:
Vanilla (single instruction), Few‑shot (few examples in the prompt), Chain‑of‑Thought (step‑by‑step reasoning), and Extended Context (including surrounding commits). - Metrics: Recall, Precision, Specificity, and token usage.
Results & Findings
| Prompting | Recall | Precision | Specificity | Avg. Tokens |
|---|---|---|---|---|
| Vanilla | 84.1% | 78.3% | 61.2% | 1.2k |
| Few‑shot | 86.7% | 79.5% | 62.0% | 0.9k |
| CoT | 82.3% | 82.9% | 66.5% | 1.5k |
| Ext. ctx | 85.0% | 78.0% | 64.0% | 1.8k |
- Overall detection: Models are better at flagging inconsistent commits (high recall) than confirming consistency (lower specificity).
- Best performer:
gpt-oss-20Bachieves the highest balanced scores but consumes >2× the tokens of smaller models. - Prompt effects:
- Few‑shot improves accuracy and cuts token count but introduces a small rise in universally wrong predictions.
- Chain‑of‑Thought boosts precision and specificity, useful when false positives are costly, at the expense of recall and higher token usage.
- Adding adjacent commit context helps larger models but adds noise for the smaller ones.
- Inconsistency type: Component, file‑path, and operation mismatches are detected reliably (>90% recall). “Purpose”‑level mismatches (semantic intent) remain the hardest, with recall dropping below 70% and token usage spiking.
Practical Implications
- Automated Code Review: Integrating an MCI detector can surface misleading commit messages early, reducing reviewer fatigue and preventing downstream bugs.
- CI/CD Gatekeeping: Teams can enforce a “commit‑message quality gate” that blocks merges when the model flags high‑confidence inconsistencies.
- Dataset Curation: Researchers building mining‑software‑repositories datasets can filter out noisy commits automatically, improving downstream tasks like bug prediction or change impact analysis.
- Security Audits: Since hidden security patches often hide behind vague messages, an MCI detector can act as a first‑line alert for potential stealthy changes.
- Tooling Ecosystem: The benchmark and code are ready to be wrapped into VS Code extensions, Git hooks, or cloud‑based LLM services, giving developers a plug‑and‑play quality check.
Limitations & Future Work
- Scope of Mutations: The benchmark relies on rule‑based mutations; real‑world inconsistencies may involve more nuanced linguistic drift that the current dataset does not capture.
- Model Size Bias: Larger models consistently outperform smaller ones, suggesting that practical deployment may need cost‑aware trade‑offs or model distillation.
- Context Window: Extended context helps only up to a point; future work should explore hierarchical or retrieval‑augmented approaches to provide truly relevant history without overwhelming the model.
- Multi‑language Support: The current dataset is limited to Java‑centric Apache projects; expanding to other languages and ecosystems (e.g., JavaScript, Python) will test generalizability.
- Human‑in‑the‑Loop: Combining model predictions with lightweight human verification could further reduce false positives, an avenue the authors plan to explore.
Authors
- Qingyu Zhang
- Puzhuo Liu
- Peng Di
- Chenxiong Qian
Paper Information
- arXiv ID: 2511.19875v1
- Categories: cs.SE, cs.AI
- Published: November 25, 2025
- PDF: Download PDF