[Paper] Generating Proof-of-Vulnerability Tests to Help Enhance the Security of Complex Software
Source: arXiv - 2605.03956v1
Overview
Modern applications are built on a stack of third‑party libraries, and a vulnerability in any of those libraries can become an exploitable attack surface for the whole app. Developers often need a concrete, runnable proof‑of‑vulnerability (PoV) test to decide whether a reported library flaw actually threatens their code. The paper introduces PoVSmith, an automated pipeline that leverages large language models (LLMs) and a coding agent to generate, run, and validate PoV tests for Java applications, dramatically cutting down manual effort while improving test reliability.
Key Contributions
- Agent‑driven test synthesis: Combines call‑path analysis with LLM prompts (Codex for code generation, GPT for reasoning) to produce PoV tests automatically.
- Iterative refinement loop: Executes generated tests, feeds runtime feedback back into the LLMs, and iteratively repairs the test until it either succeeds or is deemed infeasible.
- Context‑aware quality assessment: Uses GPT to evaluate test validity by inspecting both the test code and execution logs, reducing false positives.
- Empirical evaluation on real‑world code: Tested on 33 Java app‑library pairs, uncovering 158 distinct vulnerable entry points and successfully generating exploitable PoV tests for 84 of them (55%).
- Significant improvement over prior LLM baselines: Achieves higher coverage with far less human intervention.
Methodology
- Static Call‑Path Mining – PoVSmith first scans the application to locate public methods that eventually invoke a vulnerable library API, building a precise call graph.
- Prompt Construction – For each discovered entry point, the system crafts a multi‑part prompt that includes (a) the call path, (b) a minimal exemplar test, and (c) surrounding code context. This prompt is sent to Codex, which drafts an initial PoV test.
- Execution & Feedback – The draft test is compiled and run against the app. Any compilation errors, runtime exceptions, or missing assertions are captured.
- Iterative Repair – The captured feedback is fed back to GPT, which rewrites the test to address the issues. Steps 3–4 repeat until the test either passes (demonstrating an exploit) or the system decides it cannot be made feasible.
- Automated Assessment – A final GPT‑based evaluator reviews the test together with its execution log to label it as a valid PoV, an infeasible case, or a false positive.
Results & Findings
- Call‑Path Coverage: Out of 158 vulnerable entry points, PoVSmith correctly identified 152 (96 %).
- Test Generation Success: Produced 152 PoV tests; 84 (55 %) actually demonstrated a viable exploit against the target application.
- Quality vs. Baseline: Compared to a state‑of‑the‑art LLM‑only approach, PoVSmith reduced manual triage time by ~70 % and increased the proportion of exploitable tests from ~30 % to 55 %.
- Feedback Loop Effectiveness: The iterative repair stage fixed 68 % of initial compilation/runtime failures without human input.
Practical Implications
- Faster Vulnerability Triage: Security teams can automatically obtain concrete PoV tests, allowing quicker decisions on whether to patch, upgrade, or apply mitigations.
- Supply‑Chain Hardening: Organizations that rely heavily on open‑source components can integrate PoVSmith into CI/CD pipelines to continuously validate that newly discovered library CVEs are truly exploitable in their codebase.
- Developer Empowerment: By surfacing exact entry points and example exploits, developers gain clearer insight into how a library flaw propagates, making remediation more targeted and less disruptive.
- Tool Integration Potential: PoVSmith’s modular design (static analysis + LLM prompts + feedback loop) can be wrapped into existing SAST/DAST tools, IDE plugins, or automated security scanners.
Limitations & Future Work
- Language & Ecosystem Scope: The study focuses on Java and a specific set of libraries; extending to other languages (e.g., Python, JavaScript) will require adapting the static analysis and prompt templates.
- LLM Dependency: Test quality hinges on the underlying LLMs (Codex, GPT); model updates or API changes could affect reproducibility.
- Complex Exploits: PoVSmith currently excels at straightforward call‑path exploits; multi‑step or environment‑dependent attacks (e.g., requiring specific configurations) remain challenging.
- Future Directions: The authors plan to (a) broaden support to additional runtimes, (b) incorporate dynamic taint analysis to enrich call‑path precision, and (c) explore fine‑tuning smaller, open‑source LLMs for cost‑effective, on‑premise deployment.
Authors
- Shravya Kanchi
- Xiaoyan Zang
- Ying Zhang
- Danfeng Yao
- Na Meng
Paper Information
- arXiv ID: 2605.03956v1
- Categories: cs.CR, cs.SE
- Published: May 5, 2026
- PDF: Download PDF