[Paper] EvoRepair: Enhancing Vulnerability Repair Agents Through Experience-Based Self-Evolution

Published: 1 week ago (May 28, 2026 at 11:46 AM EDT)

5 min read

Source: arXiv

Source: arXiv - 2605.30105v1

Overview

The paper introduces EvoRepair, a novel framework that turns large‑language‑model (LLM)‑based vulnerability repair agents into self‑evolving assistants. By giving the agent a memory of past fixes and a way to refine that memory over time, EvoRepair dramatically reduces repeated mistakes and boosts repair success across multiple benchmarks.

Key Contributions

Experience‑based self‑evolution: First AVR system that continuously harvests, scores, and reuses repair “experiences” from earlier bugs.
Cyclic learn‑and‑repair loop: Integrates retrieval of relevant past fixes, on‑the‑fly extraction of new knowledge, and quality‑aware updating of an experience bank.
Strong empirical gains: Achieves 93.47 % success on PATCHEVAL and 87.00 % on SEC‑bench, outperforming the strongest prior LLM‑based baselines by 30‑70 %.
Model‑agnostic robustness: Demonstrates consistent improvements with GPT‑5‑mini, other LLMs, different programming languages, and the VUL4J transfer benchmark.
Open‑source‑ready design: The experience bank and scoring mechanisms are modular, making the approach easy to plug into existing LLM‑driven repair pipelines.

Methodology

Experience Bank Construction
- Each repair attempt generates a trajectory: the prompt, the LLM’s suggested patch, execution feedback, and a quality score (based on test pass/fail, static analysis, etc.).
- High‑quality trajectories are stored as reusable “experience snippets” indexed by vulnerability type, code context, and fix pattern.
Retrieval‑Guided Repair
- When a new vulnerability is presented, EvoRepair queries the bank for the most similar past experiences (using embeddings of the vulnerable code and CVE metadata).
- Retrieved snippets are injected into the LLM prompt as contextual hints, steering the model toward proven fix strategies.
Self‑Evolution Cycle
- After the LLM proposes a patch, EvoRepair runs the patch through test suites and static analysers.
- The outcome updates the quality score of the trajectory; successful patches enrich the bank, while failed attempts are down‑weighted or discarded.
- This loop repeats until the vulnerability is fixed or a timeout is reached, allowing the agent to “learn” from its own successes and failures.
Quality‑Aware Scoring
- Scores combine functional correctness (test pass), security impact (absence of new warnings), and code quality metrics (lint, cyclomatic complexity).
- The scoring function ensures that only robust, maintainable fixes are promoted for future reuse.

Results & Findings

Benchmark	EvoRepair	Next‑Best LLM Baseline (LoopRepair)	Gain
PATCHEVAL	93.47 %	53.91 %	+39.56 %
SEC‑bench	87.00 %	53.50 %	+33.50 %
Overall	90.46 %	73.48 % (Live‑SWE‑Agent)	+6.98 %

Cross‑benchmark consistency: EvoRepair’s advantage holds across C/C++ and Java datasets, confirming that the experience bank captures language‑agnostic repair patterns.
Transferability: When applied to the VUL4J suite (Java‑only), EvoRepair still outperformed baselines, indicating that the learned experiences generalize beyond the original training set.
Error reduction: The same logical mistake (e.g., forgetting to free memory after a buffer overflow fix) appeared in 27 % of baseline runs but dropped to <3 % with EvoRepair, showcasing the benefit of intra‑vulnerability experience accumulation.

Practical Implications

Faster Patch Generation: Developers can integrate EvoRepair into CI pipelines to automatically suggest high‑confidence patches, cutting manual triage time by up to 70 % for known vulnerability classes.
Reduced Regression Risk: Because the experience bank only promotes patches that pass stringent quality checks, the likelihood of introducing new bugs or security regressions is markedly lower.
Continuous Learning in Production: As new vulnerabilities are discovered in the wild, EvoRepair can ingest the fixes directly from the development team, instantly making that knowledge available for future incidents.
Tool‑agnostic Plug‑in: The framework’s retrieval and scoring components can be wrapped around any LLM (OpenAI, Anthropic, LLaMA, etc.), enabling existing security‑automation tools to become self‑improving without retraining the underlying model.
Compliance & Auditing: The experience bank provides a traceable log of which past fixes influenced a current patch, aiding security audits and regulatory reporting.

Limitations & Future Work

Dependence on Quality of Initial Data: The system’s performance hinges on having a sufficiently diverse and correct set of initial repair trajectories; noisy or biased seeds can propagate errors.
Scalability of Retrieval: As the experience bank grows, efficient similarity search becomes critical; the paper uses approximate nearest‑neighbor indexing, but real‑world deployments may need more sophisticated caching or hierarchical clustering.
Language‑Specific Nuances: While cross‑language gains were demonstrated, certain idioms (e.g., Rust’s ownership model) may require language‑tailored experience representations.
Human‑in‑the‑Loop Validation: The current evaluation is fully automated; future work could explore interactive modes where developers approve or edit suggested patches, further enriching the experience bank.
Security of the Experience Bank: Storing patches and vulnerability details raises concerns about leakage; future research should investigate encrypted or federated storage mechanisms.

EvoRepair shows that giving LLM‑driven security agents a memory—and a disciplined way to update it—can turn a one‑shot fixer into a continuously improving defender. For teams looking to automate vulnerability remediation at scale, the framework offers a practical pathway to smarter, safer code.

Authors

Haichuan Hu
Guoqing Xie
Quanjun Zhang
Jiawei Liu
Shengcheng Yu
Chunrong Fang
Zhenyu Chen
Liang Xiao

Paper Information

arXiv ID: 2605.30105v1
Categories: cs.SE
Published: May 28, 2026
PDF: Download PDF

[Paper] EvoRepair: Enhancing Vulnerability Repair Agents Through Experience-Based Self-Evolution

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Ladder Logic Translation using Large Language Models in Industrial Automation

[Paper] Governance-Aware Software Architecture for Multi-Stakeholder Platforms

[Paper] R+R: Reassessing Java Security API Misuse in Current LLMs: A Replication on JCA and JSSE APIs with External Security Knowledge

[Paper] What Breaks When LLMs Code? Characterizing Operational Safety Failures of Agentic Code Assistants