[Paper] Is Vibe Coding Safe? Benchmarking Vulnerability of Agent-Generated Code in Real-World Tasks

Published: 2 months ago (December 2, 2025 at 05:11 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.03262v1

Overview

The paper “Is Vibe Coding Safe? Benchmarking Vulnerability of Agent‑Generated Code in Real‑World Tasks” investigates whether code produced by large‑language‑model (LLM) agents—often called “vibe coding”—is secure enough for production use. By creating a 200‑task benchmark drawn from real open‑source feature requests that historically led to vulnerable implementations, the authors expose a worrying gap between functional correctness and security in current coding agents.

Key Contributions

SUSVIBES benchmark: A curated suite of 200 realistic feature‑request tasks that have been shown to produce vulnerable code when written by human developers.
Comprehensive evaluation of several state‑of‑the‑art coding agents (e.g., Claude 4 Sonnet, GPT‑4, CodeLlama) on SUSVIBES, measuring both functional correctness and security.
Empirical finding that even top‑performing agents achieve high functional success (≈61 % correct) but dismal security rates (≈10 % secure).
Analysis of simple mitigation attempts (e.g., adding vulnerability hints to prompts) showing they do not substantially improve security outcomes.
Call to action for the community to treat security as a first‑class metric when developing and deploying LLM‑based coding assistants.

Methodology

Task selection – The authors mined popular open‑source repositories, identified feature‑request issues that later required security patches, and distilled them into 200 self‑contained coding prompts.
Agent suite – They queried multiple publicly available coding agents (Claude 4 Sonnet, GPT‑4, CodeLlama, etc.) using the same prompts, without any extra supervision or post‑processing.
Evaluation criteria –
- Functional correctness: Does the generated code satisfy the requested feature and pass the provided test suite?
- Security: Manual and automated static analysis (e.g., Bandit, CodeQL) to detect common vulnerabilities such as injection, insecure deserialization, improper authentication, etc.
Mitigation experiments – They added “vulnerability hints” (e.g., “avoid SQL injection”) to the original prompts and re‑ran the agents to see if security improves.

The pipeline is deliberately simple so that the results reflect the out‑of‑the‑box behavior developers would experience when using these agents today.

Results & Findings

Agent (model)	Functional Correctness	Secure Solutions
Claude 4 Sonnet (SWE‑Agent)	61 %	10.5 %
GPT‑4	55 %	9.2 %
CodeLlama‑34B	48 %	7.8 %
…	…	…

Security lag: Across the board, the proportion of secure code is roughly one‑sixth of the functional success rate.
Vulnerability patterns: The most common flaws were SQL/NoSQL injection, insecure file handling, and missing authentication checks.
Prompt augmentation: Adding explicit security hints raised secure rates by only 1–2 percentage points, indicating that simple prompt engineering is insufficient.
Error propagation: When agents produced insecure code, the bugs were often subtle (e.g., using eval on user input) and escaped basic test suites, making them hard to detect without dedicated security analysis.

Practical Implications

Don’t ship LLM‑generated code blindly – Even if unit tests pass, security reviews are still mandatory.
Integrate static analysis into the generation loop – Tools like Bandit, CodeQL, or custom linters should run automatically on every LLM output before acceptance.
Adopt a “security‑first” prompt template – Instead of a single hint, embed a checklist (e.g., “sanitize all external inputs”, “use parameterized queries”) and enforce it programmatically.
Team workflows – Companies that already rely on vibe coding for rapid prototyping should allocate dedicated security engineers to audit generated patches, especially for services handling authentication, payments, or user‑generated content.
Tooling opportunities – The benchmark itself (SUSVIBES) can serve as a regression suite for future LLM releases, encouraging model developers to optimize for security alongside correctness.

Limitations & Future Work

Benchmark scope – SUSVIBES focuses on feature‑request tasks from open‑source projects; enterprise‑specific domains (e.g., embedded systems, cryptographic libraries) may exhibit different vulnerability profiles.
Static analysis reliance – While tools like Bandit catch many issues, they can miss logic‑level bugs; a manual security audit was performed on a subset only.
Prompt diversity – The study used a single “vanilla” prompt style per task; exploring richer interaction patterns (e.g., multi‑turn clarification) could affect security outcomes.
Model fine‑tuning – Future work could investigate training LLMs on security‑annotated code or incorporating reinforcement learning from human security feedback to improve safe generation.

Overall, the paper shines a spotlight on a blind spot in the hype around LLM‑driven coding: functional brilliance does not automatically translate to secure software. Developers and organizations should treat security as a first‑class metric when adopting vibe coding in production pipelines.

Authors

Songwen Zhao
Danqing Wang
Kexun Zhang
Jiaxuan Luo
Zhuo Li
Lei Li

Paper Information

arXiv ID: 2512.03262v1
Categories: cs.SE, cs.CL
Published: December 2, 2025
PDF: Download PDF

[Paper] Is Vibe Coding Safe? Benchmarking Vulnerability of Agent-Generated Code in Real-World Tasks

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Enhancing Retrieval-Augmented Generation with Entity Linking for Educational Platforms

[Paper] M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG

[Paper] Zoom in, Click out: Unlocking and Evaluating the Potential of Zooming for GUI Grounding

[Paper] To Err Is Human: Systematic Quantification of Errors in Published AI Papers via LLM Analysis