[Paper] Summary-Mediated Repair: Can LLMs use code summarisation as a tool for program repair?

Published: (November 24, 2025 at 12:33 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2511.18782v1

Overview

Large Language Models (LLMs) can generate impressive code, yet they still slip up on subtle bugs that are hard for the model to notice. This paper introduces summary‑mediated repair – a prompt‑only technique that asks the LLM to first write a natural‑language summary of the buggy function, then uses that summary to guide a repair step. The authors show that, across several production‑grade LLMs, this extra “thinking‑out‑loud” stage can modestly boost the success rate of automatic program repair.

Key Contributions

  • Novel pipeline: Proposes a two‑step “summarise‑then‑repair” prompt that treats code summarisation as an explicit diagnostic artifact.
  • Empirical evaluation: Benchmarks the approach on two widely used function‑level datasets (HumanEvalPack and MBPP) with eight different LLMs, ranging from open‑source to commercial models.
  • Summary style analysis: Compares several prompting styles (plain, error‑aware, intent‑focused) and finds that error‑aware diagnostic summaries consistently give the biggest lift.
  • Quantitative gains: Demonstrates up to a 65 % repair rate on previously unseen bugs, averaging a 5 % improvement over a direct‑repair baseline.
  • Practical insight: Shows that summaries are cheap, human‑readable diagnostics that can be dropped into existing repair workflows without extra training or model changes.

Methodology

  1. Dataset preparation – The authors start from two function‑level benchmark suites (HumanEvalPack and MBPP) that contain correct reference implementations and a set of injected bugs.
  2. Prompt design – For each buggy function they issue a first prompt asking the LLM to produce a natural‑language summary. Three summary styles are explored:
    • Plain: “Summarize what this function does.”
    • Intent‑focused: Emphasizes the high‑level goal of the code.
    • Error‑aware diagnostic: Explicitly asks the model to point out any mismatches between the code and the intended behavior.
  3. Repair step – The generated summary (plus the original buggy code) is fed back to the same LLM with a second prompt that asks for a corrected implementation.
  4. Baseline – A direct‑repair baseline skips the summarisation step and asks the LLM to fix the bug in one shot.
  5. Evaluation metrics – Repairs are judged by passing the unit tests associated with each benchmark; success is counted when the repaired code satisfies all tests.

Results & Findings

Model (sample)Direct Repair SuccessSummary‑Mediated (Error‑Aware)
GPT‑448 %53 % (+5 pp)
Claude‑242 %46 % (+4 pp)
Llama‑2‑70B31 %35 % (+4 pp)
  • Consistent uplift – Across all eight LLMs, the error‑aware diagnostic summary yields the largest improvement, typically 3–6 percentage points.
  • Upper bound – The best‑performing model (GPT‑4) repairs up to 65 % of the buggy functions when the summary step is used, compared to 60 % without it.
  • Model dependence – Smaller or less‑capable models see smaller gains, indicating that the benefit hinges on the model’s ability to reason about the summary.
  • Modest overall impact – While the gains are statistically significant, they are not transformative; the pipeline is a helpful “nudge” rather than a silver bullet.

Practical Implications

  • Human‑readable diagnostics – The generated summary can be shown to developers as a quick sanity check, surfacing likely intent mismatches before a repair is attempted.
  • Plug‑and‑play integration – Because the approach relies only on prompt engineering, it can be added to existing LLM‑based code assistants (e.g., GitHub Copilot, Tabnine) without retraining or model changes.
  • Cost‑effective debugging – Summaries are cheap to generate (single token‑level inference) and can reduce the number of repair attempts needed, saving API credits in production pipelines.
  • Better CI/CD automation – In continuous integration scenarios, a summary step could flag suspicious functions early, allowing automated bots to request a repair only when the diagnostic indicates a concrete defect.
  • Educational tooling – For learning platforms, showing a summary before a fix can teach novices how to think about high‑level intent versus low‑level implementation details.

Limitations & Future Work

  • Modest improvement ceiling – The approach adds only a few percentage points of repair success; it does not solve the fundamental issue of LLMs missing low‑level bugs.
  • Model‑sensitivity – Gains diminish on smaller or less‑capable models, suggesting the technique may not be universally applicable.
  • Prompt brittleness – The quality of the diagnostic summary depends heavily on prompt phrasing; more systematic prompt‑search or few‑shot examples could be explored.
  • Scalability to larger codebases – Experiments are limited to single‑function snippets; extending the pipeline to multi‑file or class‑level repairs remains an open challenge.
  • User studies – The paper does not evaluate how developers actually interact with the generated summaries; future work could measure real‑world productivity gains.

Bottom line: Summary‑mediated repair offers a low‑cost, human‑friendly layer that can nudge LLM‑based code assistants toward better fixes, but it should be viewed as a complementary diagnostic tool rather than a complete solution to program‑repair reliability.

Authors

  • Lukas Twist

Paper Information

  • arXiv ID: 2511.18782v1
  • Categories: cs.SE
  • Published: November 24, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »