[Paper] CodeFuse-CommitEval: Towards Benchmarking LLM's Power on Commit Message and Code Change Inconsistency Detection

Published: (November 24, 2025 at 10:33 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2511.19875v1

Overview

The paper introduces CODEFUSE‑COMMITEVAL, the first benchmark that measures how well large language models (LLMs) can spot mismatches between a commit message and the actual code changes it describes—a problem known as message‑code inconsistency (MCI). By providing a curated, mutation‑based dataset and a thorough evaluation of several open‑source LLMs, the authors lay the groundwork for building tools that automatically flag misleading or low‑quality commit messages.

Key Contributions

  • A dedicated MCI benchmark built on the high‑quality ApacheCM dataset, covering seven systematically generated inconsistency types.
  • Two‑step validation pipeline that guarantees both positive (inconsistent) and negative (consistent) samples are reliable.
  • Comprehensive evaluation of six state‑of‑the‑art open‑source LLMs under four prompting regimes: vanilla, few‑shot, chain‑of‑thought (CoT), and extended context.
  • Empirical insights on how model size, prompting style, and inconsistency type affect detection performance and token consumption.
  • Open‑source release of the dataset, evaluation scripts, and detailed analysis to foster reproducible research.

Methodology

  1. Data Construction

    • Start with ApacheCM, a collection of real, well‑documented commits from Apache projects.
    • Apply rule‑guided mutations (e.g., swapping component names, altering file paths, changing operation verbs, or misrepresenting the overall purpose) to turn a consistent commit into an inconsistent one.
    • Generate seven inconsistency categories ranging from low‑level syntactic mismatches to high‑level semantic “purpose” errors.
  2. Validation

    • Positive validation: Human reviewers confirm that mutated messages truly contradict the diff.
    • Negative validation: A separate check ensures the original (unaltered) commits remain consistent.
  3. Model Evaluation

    • Six open‑source LLMs (e.g., Llama‑2‑7B, Mistral‑7B, GPT‑OSS‑20B, etc.) are tested in a binary classification setting (consistent vs. inconsistent).
    • Four prompting strategies are compared:
      Vanilla (single instruction), Few‑shot (few examples in the prompt), Chain‑of‑Thought (step‑by‑step reasoning), and Extended Context (including surrounding commits).
    • Metrics: Recall, Precision, Specificity, and token usage.

Results & Findings

PromptingRecallPrecisionSpecificityAvg. Tokens
Vanilla84.1%78.3%61.2%1.2k
Few‑shot86.7%79.5%62.0%0.9k
CoT82.3%82.9%66.5%1.5k
Ext. ctx85.0%78.0%64.0%1.8k
  • Overall detection: Models are better at flagging inconsistent commits (high recall) than confirming consistency (lower specificity).
  • Best performer: gpt-oss-20B achieves the highest balanced scores but consumes >2× the tokens of smaller models.
  • Prompt effects:
    • Few‑shot improves accuracy and cuts token count but introduces a small rise in universally wrong predictions.
    • Chain‑of‑Thought boosts precision and specificity, useful when false positives are costly, at the expense of recall and higher token usage.
    • Adding adjacent commit context helps larger models but adds noise for the smaller ones.
  • Inconsistency type: Component, file‑path, and operation mismatches are detected reliably (>90% recall). “Purpose”‑level mismatches (semantic intent) remain the hardest, with recall dropping below 70% and token usage spiking.

Practical Implications

  • Automated Code Review: Integrating an MCI detector can surface misleading commit messages early, reducing reviewer fatigue and preventing downstream bugs.
  • CI/CD Gatekeeping: Teams can enforce a “commit‑message quality gate” that blocks merges when the model flags high‑confidence inconsistencies.
  • Dataset Curation: Researchers building mining‑software‑repositories datasets can filter out noisy commits automatically, improving downstream tasks like bug prediction or change impact analysis.
  • Security Audits: Since hidden security patches often hide behind vague messages, an MCI detector can act as a first‑line alert for potential stealthy changes.
  • Tooling Ecosystem: The benchmark and code are ready to be wrapped into VS Code extensions, Git hooks, or cloud‑based LLM services, giving developers a plug‑and‑play quality check.

Limitations & Future Work

  • Scope of Mutations: The benchmark relies on rule‑based mutations; real‑world inconsistencies may involve more nuanced linguistic drift that the current dataset does not capture.
  • Model Size Bias: Larger models consistently outperform smaller ones, suggesting that practical deployment may need cost‑aware trade‑offs or model distillation.
  • Context Window: Extended context helps only up to a point; future work should explore hierarchical or retrieval‑augmented approaches to provide truly relevant history without overwhelming the model.
  • Multi‑language Support: The current dataset is limited to Java‑centric Apache projects; expanding to other languages and ecosystems (e.g., JavaScript, Python) will test generalizability.
  • Human‑in‑the‑Loop: Combining model predictions with lightweight human verification could further reduce false positives, an avenue the authors plan to explore.

Authors

  • Qingyu Zhang
  • Puzhuo Liu
  • Peng Di
  • Chenxiong Qian

Paper Information

  • arXiv ID: 2511.19875v1
  • Categories: cs.SE, cs.AI
  • Published: November 25, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »