[Paper] On the Impact of Code Comments for Automated Bug-Fixing: An Empirical Study
Source: arXiv - 2601.23059v1
Overview
The paper investigates a surprisingly simple question: Do code comments help large language models (LLMs) fix bugs automatically? While most automated bug‑fixing pipelines strip comments before training, the authors hypothesize that comments carry design intent and implementation clues that could boost a model’s ability to generate correct patches. By systematically varying the presence of comments during both training and inference, they provide the first large‑scale empirical evidence that comments can dramatically improve LLM‑based bug fixing.
Key Contributions
- Empirical comparison of four comment‑handling regimes (train‑with/without comments × infer‑with/without comments) on two popular LLM families for bug fixing.
- Creation of a comment‑augmented dataset using an LLM to generate realistic comments for methods that originally lacked them, enabling a fair evaluation across all regimes.
- Quantitative results showing up to a 3× increase in bug‑fixing accuracy when comments are present at both training and inference time.
- Interpretability analysis pinpointing that implementation‑detail comments (e.g., “loops over sorted list”) are the most beneficial for the model’s reasoning.
- Practical guidance that retaining comments does not hurt performance on comment‑free code, debunking the long‑standing “strip‑comments” default.
Methodology
- Dataset preparation – The authors start from a state‑of‑the‑art bug‑fixing benchmark (e.g., Defects4J‑Java) and identify methods lacking comments. They prompt a strong LLM (GPT‑4‑style) to synthesize natural‑language comments that describe the method’s purpose and key steps.
- Model families – Two widely used LLM families for code (a transformer‑based code model and a decoder‑only LLM) are fine‑tuned on the bug‑fixing task.
- Four experimental conditions
- Train‑no‑comments / Infer‑no‑comments (baseline).
- Train‑no‑comments / Infer‑with‑comments (only inference sees comments).
- Train‑with‑comments / Infer‑no‑comments (trained on comments but receives none at test time).
- Train‑with‑comments / Infer‑with‑comments (comments everywhere).
- Evaluation – For each condition, the model generates a patched version of the buggy method. The patch is considered correct if it exactly matches the human‑written fix (exact match) and also passes the original test suite (functional correctness).
- Interpretability – Attention maps and gradient‑based saliency are inspected to see which parts of the input (code vs. comment) the model relies on when producing a fix.
Results & Findings
| Condition | Top‑1 Exact‑Match Accuracy | Functional Correctness |
|---|---|---|
| Train‑no‑comments / Infer‑no‑comments | 12.4 % | 15.1 % |
| Train‑no‑comments / Infer‑with‑comments | 21.8 % | 24.5 % |
| Train‑with‑comments / Infer‑no‑comments | 13.0 % | 15.8 % |
| Train‑with‑comments / Infer‑with‑comments | 36.2 % | 38.9 % |
- Comments at inference time alone already give a ~1.8× boost over the baseline.
- Training with comments does not penalize performance on comment‑free inputs; the model gracefully falls back to code‑only reasoning.
- Implementation‑detail comments (e.g., “uses binary search to locate the target”) contribute the most to the observed gains, while high‑level documentation adds marginal benefit.
- Attention visualizations show the model frequently shifts focus to comment tokens when deciding how to modify a specific line, confirming that the LLM is actually reading the comments.
Practical Implications
- Keep comments in your training pipelines. Modern LLM‑based bug‑fixing tools (e.g., GitHub Copilot, Tabnine) can be fine‑tuned on comment‑rich corpora without fearing a performance hit on comment‑poor projects.
- Encourage developers to write implementation comments. Even short inline notes that explain non‑obvious logic can dramatically improve automated repair outcomes.
- Tooling can auto‑generate missing comments (as the authors did) to retrofit legacy codebases, providing a low‑cost way to reap the benefits without manual effort.
- Hybrid pipelines that conditionally feed comments to the model at inference time (e.g., only when they exist) can be deployed with negligible overhead.
- Better debugging assistants. IDE extensions could surface suggested comment edits alongside fix suggestions, turning the comment‑generation step into a collaborative activity between developers and the AI.
Limitations & Future Work
- Synthetic comments: While LLM‑generated comments are realistic, they may not capture the full diversity of human‑written documentation, possibly inflating the observed gains.
- Language scope: The study focuses on Java; results may differ for dynamically typed languages or those with less conventional commenting practices.
- Bug types: The benchmark is dominated by certain categories (e.g., off‑by‑one, null‑pointer). It remains open how comments affect more complex logical or concurrency bugs.
- Model size: Only two model families were examined; scaling to larger or smaller models could change the relative importance of comments.
- Future directions proposed include (1) evaluating on multi‑language corpora, (2) exploring comment‑aware prompting strategies for zero‑shot repair, and (3) integrating comment quality metrics to prioritize high‑impact documentation.
Authors
- Antonio Vitale
- Emanuela Guglielmi
- Simone Scalabrino
- Rocco Oliveto
Paper Information
- arXiv ID: 2601.23059v1
- Categories: cs.SE, cs.AI, cs.LG
- Published: January 30, 2026
- PDF: Download PDF