[Paper] Is Vibe Coding Safe? Benchmarking Vulnerability of Agent-Generated Code in Real-World Tasks
Source: arXiv - 2512.03262v1
Overview
The paper “Is Vibe Coding Safe? Benchmarking Vulnerability of Agent‑Generated Code in Real‑World Tasks” investigates whether code produced by large‑language‑model (LLM) agents—often called “vibe coding”—is secure enough for production use. By creating a 200‑task benchmark drawn from real open‑source feature requests that historically led to vulnerable implementations, the authors expose a worrying gap between functional correctness and security in current coding agents.
Key Contributions
- SUSVIBES benchmark: A curated suite of 200 realistic feature‑request tasks that have been shown to produce vulnerable code when written by human developers.
- Comprehensive evaluation of several state‑of‑the‑art coding agents (e.g., Claude 4 Sonnet, GPT‑4, CodeLlama) on SUSVIBES, measuring both functional correctness and security.
- Empirical finding that even top‑performing agents achieve high functional success (≈61% correct) but dismal security rates (≈10% secure).
- Analysis of simple mitigation attempts (e.g., adding vulnerability hints to prompts) showing they do not substantially improve security outcomes.
- Call to action for the community to treat security as a first‑class metric when developing and deploying LLM‑based coding assistants.
Methodology
- Task selection – The authors mined popular open‑source repositories, identified feature‑request issues that later required security patches, and distilled them into 200 self‑contained coding prompts.
- Agent suite – They queried multiple publicly available coding agents (Claude 4 Sonnet, GPT‑4, CodeLlama, etc.) using the same prompts, without any extra supervision or post‑processing.
- Evaluation criteria –
- Functional correctness: Does the generated code satisfy the requested feature and pass the provided test suite?
- Security: Manual and automated static analysis (e.g., Bandit, CodeQL) to detect common vulnerabilities such as injection, insecure deserialization, improper authentication, etc.
- Mitigation experiments – They added “vulnerability hints” (e.g., “avoid SQL injection”) to the original prompts and re‑ran the agents to see if security improves.
The pipeline is deliberately simple so that the results reflect the out‑of‑the‑box behavior developers would experience when using these agents today.
Results & Findings
| Agent (model) | Functional Correctness | Secure Solutions |
|---|---|---|
| Claude 4 Sonnet (SWE‑Agent) | 61 % | 10.5 % |
| GPT‑4 | 55 % | 9.2 % |
| CodeLlama‑34B | 48 % | 7.8 % |
| … | … | … |
- Security lag: Across the board, the proportion of secure code is roughly one‑sixth of the functional success rate.
- Vulnerability patterns: The most common flaws were SQL/NoSQL injection, insecure file handling, and missing authentication checks.
- Prompt augmentation: Adding explicit security hints raised secure rates by only 1–2 percentage points, indicating that simple prompt engineering is insufficient.
- Error propagation: When agents produced insecure code, the bugs were often subtle (e.g., using
evalon user input) and escaped basic test suites, making them hard to detect without dedicated security analysis.
Practical Implications
- Don’t ship LLM‑generated code blindly – Even if unit tests pass, security reviews are still mandatory.
- Integrate static analysis into the generation loop – Tools like Bandit, CodeQL, or custom linters should run automatically on every LLM output before acceptance.
- Adopt a “security‑first” prompt template – Instead of a single hint, embed a checklist (e.g., “sanitize all external inputs”, “use parameterized queries”) and enforce it programmatically.
- Team workflows – Companies that already rely on vibe coding for rapid prototyping should allocate dedicated security engineers to audit generated patches, especially for services handling authentication, payments, or user‑generated content.
- Tooling opportunities – The benchmark itself (SUSVIBES) can serve as a regression suite for future LLM releases, encouraging model developers to optimize for security alongside correctness.
Limitations & Future Work
- Benchmark scope – SUSVIBES focuses on feature‑request tasks from open‑source projects; enterprise‑specific domains (e.g., embedded systems, cryptographic libraries) may exhibit different vulnerability profiles.
- Static analysis reliance – While tools like Bandit catch many issues, they can miss logic‑level bugs; a manual security audit was performed on a subset only.
- Prompt diversity – The study used a single “vanilla” prompt style per task; exploring richer interaction patterns (e.g., multi‑turn clarification) could affect security outcomes.
- Model fine‑tuning – Future work could investigate training LLMs on security‑annotated code or incorporating reinforcement learning from human security feedback to improve safe generation.
Overall, the paper shines a spotlight on a blind spot in the hype around LLM‑driven coding: functional brilliance does not automatically translate to secure software. Developers and organizations should treat security as a first‑class metric when adopting vibe coding in production pipelines.
Authors
- Songwen Zhao
- Danqing Wang
- Kexun Zhang
- Jiaxuan Luo
- Zhuo Li
- Lei Li
Paper Information
- arXiv ID: 2512.03262v1
- Categories: cs.SE, cs.CL
- Published: December 2, 2025
- PDF: Download PDF