[Paper] DynaFix: Iterative Automated Program Repair Driven by Execution-Level Dynamic Information

Published: (December 31, 2025 at 12:13 AM EST)
3 min read
Source: arXiv

Source: arXiv - 2512.24635v1

Overview

The paper introduces DynaFix, a new automated program repair (APR) technique that feeds execution‑level runtime data back into large language models (LLMs) during the patch‑generation loop. By mimicking how developers debug—examining variable values, control‑flow paths, and call stacks after each failed attempt—DynaFix achieves a measurable boost in both repair success rate and efficiency on the widely‑used Defects4J benchmark.

Key Contributions

  • Iterative dynamic feedback loop: Captures fine‑grained runtime information after every patch attempt and injects it as structured prompts for the LLM.
  • Fine‑grained execution representation: Transforms variable states, control‑flow traces, and call‑stack snapshots into a prompt format that LLMs can reason over.
  • Empirical gains: Repairs 186 single‑function bugs (≈10 % improvement over the strongest baselines) and fixes 38 bugs that prior APR tools could not.
  • Search‑space reduction: Limits the number of repair attempts to ≤ 35 per bug and cuts the candidate‑patch space by ~70 % compared with existing iterative APR frameworks.

Methodology

  1. Initial Test Run – Execute the buggy program on its test suite; failing test cases trigger the first data collection.
  2. Dynamic Information Extraction – A lightweight instrumentation layer records:
    • Current values of all in‑scope variables
    • The exact control‑flow path taken (e.g., which branches were hit)
    • The call stack at the point of failure
  3. Prompt Construction – Serialize the collected data into a concise, human‑readable “debug report” appended to the LLM’s repair prompt (e.g., “The variable count was -1 at line 42; the program took the else branch of if (count > 0) …”).
  4. LLM Patch Generation – A code‑capable LLM (e.g., GPT‑4‑code) produces one or more candidate patches guided by the debug report.
  5. Validation & Iteration – Compile and re‑run the candidate patch against the test suite. If it still fails, repeat steps 2‑4 with new runtime data from the latest execution.
  6. Termination – Stop when a patch passes all tests or a pre‑defined attempt limit (35) is reached.

The approach is model‑agnostic: any LLM that can understand the structured prompt can be swapped in, making DynaFix a plug‑and‑play layer on top of existing LLM‑based APR pipelines.

Results & Findings

MetricDynaFixBest Prior LLM‑APR
Bugs repaired (Defects4J v1.2 + v2.0)186169
New bugs repaired (not fixed by any baseline)380
Avg. attempts per bug (successful cases)≤ 3555‑80
Search‑space reduction~70 %
Runtime overhead (instrumentation + prompt generation)< 2 s per iteration (negligible vs. LLM inference)

The authors report that the dynamic prompts dramatically improve the LLM’s “understanding” of why a patch failed, leading to more targeted edits rather than blind trial‑and‑error. Even for complex bugs requiring multiple code changes, DynaFix converges within a handful of iterations.

Practical Implications

  • Faster CI/CD fixes – Integrating DynaFix into a continuous‑integration pipeline could automatically generate high‑quality patches after a failing build, reducing mean‑time‑to‑repair.
  • Better debugging assistants – IDE plugins can expose the same execution‑level prompts to developers, turning LLM suggestions into interactive, step‑wise debugging hints.
  • Lower cost of LLM inference – By pruning the search space early, fewer LLM calls are needed, translating into tangible cost savings for cloud‑based inference services.
  • Language‑agnostic extension – While evaluated on Java, the instrumentation concept works for any language with a runtime tracer (e.g., Python’s sys.settrace, .NET profilers), opening the door to cross‑language APR tools.

Limitations & Future Work

  • Instrumentation overhead may be non‑trivial for large, performance‑critical applications; the authors suggest selective tracing as a mitigation.
  • The current evaluation focuses on single‑function bugs; scaling to multi‑module or system‑wide defects remains an open challenge.
  • DynaFix relies on the availability of a passing test suite for the “correct” behavior; future work could explore weak oracles (e.g., metamorphic relations) to broaden applicability.

Authors

  • Zhili Huang
  • Ling Xu
  • Chao Liu
  • Weifeng Sun
  • Xu Zhang
  • Yan Lei
  • Meng Yan
  • Hongyu Zhang

Paper Information

  • arXiv ID: 2512.24635v1
  • Categories: cs.SE, cs.AI
  • Published: December 31, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »