[Paper] EvoRepair: Enhancing Vulnerability Repair Agents Through Experience-Based Self-Evolution
Source: arXiv - 2605.30105v1
Overview
The paper introduces EvoRepair, a novel framework that turns large‑language‑model (LLM)‑based vulnerability repair agents into self‑evolving assistants. By giving the agent a memory of past fixes and a way to refine that memory over time, EvoRepair dramatically reduces repeated mistakes and boosts repair success across multiple benchmarks.
Key Contributions
- Experience‑based self‑evolution: First AVR system that continuously harvests, scores, and reuses repair “experiences” from earlier bugs.
- Cyclic learn‑and‑repair loop: Integrates retrieval of relevant past fixes, on‑the‑fly extraction of new knowledge, and quality‑aware updating of an experience bank.
- Strong empirical gains: Achieves 93.47 % success on PATCHEVAL and 87.00 % on SEC‑bench, outperforming the strongest prior LLM‑based baselines by 30‑70 %.
- Model‑agnostic robustness: Demonstrates consistent improvements with GPT‑5‑mini, other LLMs, different programming languages, and the VUL4J transfer benchmark.
- Open‑source‑ready design: The experience bank and scoring mechanisms are modular, making the approach easy to plug into existing LLM‑driven repair pipelines.
Methodology
-
Experience Bank Construction
- Each repair attempt generates a trajectory: the prompt, the LLM’s suggested patch, execution feedback, and a quality score (based on test pass/fail, static analysis, etc.).
- High‑quality trajectories are stored as reusable “experience snippets” indexed by vulnerability type, code context, and fix pattern.
-
Retrieval‑Guided Repair
- When a new vulnerability is presented, EvoRepair queries the bank for the most similar past experiences (using embeddings of the vulnerable code and CVE metadata).
- Retrieved snippets are injected into the LLM prompt as contextual hints, steering the model toward proven fix strategies.
-
Self‑Evolution Cycle
- After the LLM proposes a patch, EvoRepair runs the patch through test suites and static analysers.
- The outcome updates the quality score of the trajectory; successful patches enrich the bank, while failed attempts are down‑weighted or discarded.
- This loop repeats until the vulnerability is fixed or a timeout is reached, allowing the agent to “learn” from its own successes and failures.
-
Quality‑Aware Scoring
- Scores combine functional correctness (test pass), security impact (absence of new warnings), and code quality metrics (lint, cyclomatic complexity).
- The scoring function ensures that only robust, maintainable fixes are promoted for future reuse.
Results & Findings
| Benchmark | EvoRepair | Next‑Best LLM Baseline (LoopRepair) | Gain |
|---|---|---|---|
| PATCHEVAL | 93.47 % | 53.91 % | +39.56 % |
| SEC‑bench | 87.00 % | 53.50 % | +33.50 % |
| Overall | 90.46 % | 73.48 % (Live‑SWE‑Agent) | +6.98 % |
- Cross‑benchmark consistency: EvoRepair’s advantage holds across C/C++ and Java datasets, confirming that the experience bank captures language‑agnostic repair patterns.
- Transferability: When applied to the VUL4J suite (Java‑only), EvoRepair still outperformed baselines, indicating that the learned experiences generalize beyond the original training set.
- Error reduction: The same logical mistake (e.g., forgetting to free memory after a buffer overflow fix) appeared in 27 % of baseline runs but dropped to <3 % with EvoRepair, showcasing the benefit of intra‑vulnerability experience accumulation.
Practical Implications
- Faster Patch Generation: Developers can integrate EvoRepair into CI pipelines to automatically suggest high‑confidence patches, cutting manual triage time by up to 70 % for known vulnerability classes.
- Reduced Regression Risk: Because the experience bank only promotes patches that pass stringent quality checks, the likelihood of introducing new bugs or security regressions is markedly lower.
- Continuous Learning in Production: As new vulnerabilities are discovered in the wild, EvoRepair can ingest the fixes directly from the development team, instantly making that knowledge available for future incidents.
- Tool‑agnostic Plug‑in: The framework’s retrieval and scoring components can be wrapped around any LLM (OpenAI, Anthropic, LLaMA, etc.), enabling existing security‑automation tools to become self‑improving without retraining the underlying model.
- Compliance & Auditing: The experience bank provides a traceable log of which past fixes influenced a current patch, aiding security audits and regulatory reporting.
Limitations & Future Work
- Dependence on Quality of Initial Data: The system’s performance hinges on having a sufficiently diverse and correct set of initial repair trajectories; noisy or biased seeds can propagate errors.
- Scalability of Retrieval: As the experience bank grows, efficient similarity search becomes critical; the paper uses approximate nearest‑neighbor indexing, but real‑world deployments may need more sophisticated caching or hierarchical clustering.
- Language‑Specific Nuances: While cross‑language gains were demonstrated, certain idioms (e.g., Rust’s ownership model) may require language‑tailored experience representations.
- Human‑in‑the‑Loop Validation: The current evaluation is fully automated; future work could explore interactive modes where developers approve or edit suggested patches, further enriching the experience bank.
- Security of the Experience Bank: Storing patches and vulnerability details raises concerns about leakage; future research should investigate encrypted or federated storage mechanisms.
EvoRepair shows that giving LLM‑driven security agents a memory—and a disciplined way to update it—can turn a one‑shot fixer into a continuously improving defender. For teams looking to automate vulnerability remediation at scale, the framework offers a practical pathway to smarter, safer code.
Authors
- Haichuan Hu
- Guoqing Xie
- Quanjun Zhang
- Jiawei Liu
- Shengcheng Yu
- Chunrong Fang
- Zhenyu Chen
- Liang Xiao
Paper Information
- arXiv ID: 2605.30105v1
- Categories: cs.SE
- Published: May 28, 2026
- PDF: Download PDF