[Paper] Well Begun is Half Done: Location-Aware and Trace-Guided Iterative Automated Vulnerability Repair
Source: arXiv - 2512.20203v1
Overview
This paper presents \sysname, a novel automated vulnerability‑repair system that leverages large language models (LLMs) but goes a step further by telling the model where to patch first and by rating the quality of each generated patch during an iterative repair loop. By focusing on location awareness and a lightweight quality‑assessment metric, the authors achieve markedly higher success rates on real‑world C/C++ bugs than prior neural‑translation, static‑analysis, or LLM‑only approaches.
Key Contributions
- Location‑aware patch guidance – a lightweight analysis that ranks code locations needing repair, steering the LLM to edit the most promising spots first.
- Trace‑guided iterative repair – an automated loop that evaluates each test‑failing candidate patch on two dimensions (new‑vulnerability introduction & taint‑statement coverage) and selects the best candidate for the next iteration.
- Two‑dimensional patch quality metric – combines security safety (no new bugs) with taint coverage to approximate how “complete” a fix is without full manual review.
- Empirical validation on VulnLoc+ – a curated dataset of 40 real C/C++ vulnerabilities with their Proof‑of‑Vulnerability (PoV) exploits; \sysname produces 27 plausible patches and repairs 8–13 more bugs than the strongest baselines.
- Open‑source prototype – the authors release the implementation and the extended dataset, enabling reproducibility and further research.
Methodology
- Pre‑processing & Location Ranking
- Static analysis extracts taint‑propagation graphs from the vulnerable program and its PoV.
- Each statement receives a repair priority score based on how directly it participates in the taint flow that leads to the exploit.
- LLM Prompt Construction
- The top‑ranked location(s) are embedded into the prompt as explicit “edit‑here” markers, while the rest of the source code is provided for context.
- A state‑of‑the‑art code‑oriented LLM (e.g., GPT‑4‑code) generates a candidate patch.
- Iterative Evaluation Loop
- The candidate is compiled and run against the test suite (including the PoV).
- Two quality signals are computed:
- Safety – does the patch introduce any new compiler warnings, undefined‑behavior warnings, or new failing tests?
- Taint Coverage – what fraction of the original taint paths are now blocked?
- The patch with the highest combined score becomes the “seed” for the next iteration; the process repeats until the test suite passes or a max‑iteration budget is hit.
- Final Selection
- The best‑scoring patch that makes all tests pass is reported as the plausible fix; manual verification determines whether it is correct (i.e., truly removes the vulnerability without side effects).
Results & Findings
| Metric | \sysname | NMT‑based | Program‑analysis | Prior LLM |
|---|---|---|---|---|
| Plausible patches (out of 40) | 27 | 15–19 | 12–16 | 14–18 |
| Correct patches (fully fixed) | 13 | 5–7 | 4–6 | 6–8 |
| Avg. iterations per bug | 3.2 | 5.6 | 7.1 | 5.9 |
| New‑vulnerability introductions | < 2 % | 8 % | 12 % | 9 % |
Key takeaways
- Location awareness cuts down the number of wasted LLM generations, leading to fewer iterations and higher success.
- The two‑dimensional quality score reliably filters out patches that would otherwise pass the test suite but re‑introduce security issues.
- Even with a modest dataset (40 bugs), the gains are statistically significant, suggesting the approach scales to larger corpora.
Practical Implications
- Developer tooling – IDE plugins could embed \sysname’s location‑ranking engine to suggest where a developer should focus when fixing a reported CVE, reducing guesswork.
- CI/CD pipelines – an automated “repair‑as‑you‑test” stage could attempt a quick fix for newly discovered static‑analysis warnings, automatically submitting a pull request if a high‑quality patch is found.
- Bug‑bounty platforms – the system can generate plausible patches for disclosed PoVs, accelerating the verification loop between researchers and vendors.
- Security‑oriented code review – the taint‑coverage metric offers a lightweight sanity check that can be added to existing static‑analysis suites without heavy formal verification.
Overall, \sysname demonstrates that guiding LLMs with domain‑specific signals (location + quality feedback) turns a “blind” generation process into a focused, semi‑automated debugging assistant, a pattern that can be replicated for other bug classes (e.g., memory leaks, concurrency bugs).
Limitations & Future Work
- Dataset size – evaluation is limited to 40 C/C++ vulnerabilities; broader benchmarks (e.g., Juliet, Defects4J) are needed to confirm generality.
- Language dependence – the current implementation relies on taint analysis specific to C/C++; adapting the location‑ranking to managed languages (Java, Python) will require different static analyses.
- LLM cost – iterative prompting can be expensive; future work could explore caching, few‑shot fine‑tuning, or smaller specialist models to reduce inference overhead.
- Patch correctness verification – the study still depends on manual validation for “correct” patches; integrating formal verification or symbolic execution could automate this step.
By addressing these points, the community can move toward fully autonomous, production‑grade vulnerability repair pipelines.
Authors
- Zhenlei Ye
- Xiaobing Sun
- Sicong Cao
- Lili Bo
- Bin Li
Paper Information
- arXiv ID: 2512.20203v1
- Categories: cs.SE
- Published: December 23, 2025
- PDF: Download PDF