[Paper] Summary-Mediated Repair: Can LLMs use code summarisation as a tool for program repair?

Published: 1 week ago (November 24, 2025 at 12:33 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2511.18782v1

Overview

Large Language Models (LLMs) can generate impressive code, yet they still slip up on subtle bugs that are hard for the model to notice. This paper introduces summary‑mediated repair – a prompt‑only technique that asks the LLM to first write a natural‑language summary of the buggy function, then uses that summary to guide a repair step. The authors show that, across several production‑grade LLMs, this extra “thinking‑out‑loud” stage can modestly boost the success rate of automatic program repair.

Key Contributions

Novel pipeline: Proposes a two‑step “summarise‑then‑repair” prompt that treats code summarisation as an explicit diagnostic artifact.
Empirical evaluation: Benchmarks the approach on two widely used function‑level datasets (HumanEvalPack and MBPP) with eight different LLMs, ranging from open‑source to commercial models.
Summary style analysis: Compares several prompting styles (plain, error‑aware, intent‑focused) and finds that error‑aware diagnostic summaries consistently give the biggest lift.
Quantitative gains: Demonstrates up to a 65 % repair rate on previously unseen bugs, averaging a 5 % improvement over a direct‑repair baseline.
Practical insight: Shows that summaries are cheap, human‑readable diagnostics that can be dropped into existing repair workflows without extra training or model changes.

Methodology

Dataset preparation – The authors start from two function‑level benchmark suites (HumanEvalPack and MBPP) that contain correct reference implementations and a set of injected bugs.
Prompt design – For each buggy function they issue a first prompt asking the LLM to produce a natural‑language summary. Three summary styles are explored:
- Plain: “Summarize what this function does.”
- Intent‑focused: Emphasizes the high‑level goal of the code.
- Error‑aware diagnostic: Explicitly asks the model to point out any mismatches between the code and the intended behavior.
Repair step – The generated summary (plus the original buggy code) is fed back to the same LLM with a second prompt that asks for a corrected implementation.
Baseline – A direct‑repair baseline skips the summarisation step and asks the LLM to fix the bug in one shot.
Evaluation metrics – Repairs are judged by passing the unit tests associated with each benchmark; success is counted when the repaired code satisfies all tests.

Results & Findings

Model (sample)	Direct Repair Success	Summary‑Mediated (Error‑Aware)
GPT‑4	48 %	53 % (+5 pp)
Claude‑2	42 %	46 % (+4 pp)
Llama‑2‑70B	31 %	35 % (+4 pp)
…	…	…

Consistent uplift – Across all eight LLMs, the error‑aware diagnostic summary yields the largest improvement, typically 3–6 percentage points.
Upper bound – The best‑performing model (GPT‑4) repairs up to 65 % of the buggy functions when the summary step is used, compared to 60 % without it.
Model dependence – Smaller or less‑capable models see smaller gains, indicating that the benefit hinges on the model’s ability to reason about the summary.
Modest overall impact – While the gains are statistically significant, they are not transformative; the pipeline is a helpful “nudge” rather than a silver bullet.

Practical Implications

Human‑readable diagnostics – The generated summary can be shown to developers as a quick sanity check, surfacing likely intent mismatches before a repair is attempted.
Plug‑and‑play integration – Because the approach relies only on prompt engineering, it can be added to existing LLM‑based code assistants (e.g., GitHub Copilot, Tabnine) without retraining or model changes.
Cost‑effective debugging – Summaries are cheap to generate (single token‑level inference) and can reduce the number of repair attempts needed, saving API credits in production pipelines.
Better CI/CD automation – In continuous integration scenarios, a summary step could flag suspicious functions early, allowing automated bots to request a repair only when the diagnostic indicates a concrete defect.
Educational tooling – For learning platforms, showing a summary before a fix can teach novices how to think about high‑level intent versus low‑level implementation details.

Limitations & Future Work

Modest improvement ceiling – The approach adds only a few percentage points of repair success; it does not solve the fundamental issue of LLMs missing low‑level bugs.
Model‑sensitivity – Gains diminish on smaller or less‑capable models, suggesting the technique may not be universally applicable.
Prompt brittleness – The quality of the diagnostic summary depends heavily on prompt phrasing; more systematic prompt‑search or few‑shot examples could be explored.
Scalability to larger codebases – Experiments are limited to single‑function snippets; extending the pipeline to multi‑file or class‑level repairs remains an open challenge.
User studies – The paper does not evaluate how developers actually interact with the generated summaries; future work could measure real‑world productivity gains.

Bottom line: Summary‑mediated repair offers a low‑cost, human‑friendly layer that can nudge LLM‑based code assistants toward better fixes, but it should be viewed as a complementary diagnostic tool rather than a complete solution to program‑repair reliability.

Authors

Lukas Twist

Paper Information

arXiv ID: 2511.18782v1
Categories: cs.SE
Published: November 24, 2025
PDF: Download PDF

[Paper] Summary-Mediated Repair: Can LLMs use code summarisation as a tool for program repair?

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Configuration Defects in Kubernetes

[Paper] POLARIS: Is Multi-Agentic Reasoning the Next Wave in Engineering Self-Adaptive Systems?

[Paper] Cross-Task Benchmarking and Evaluation of General-Purpose and Code-Specific Large Language Models

[Paper] PBFuzz: Agentic Directed Fuzzing for PoV Generation