[Paper] MOSAIC-Bench: Measuring Compositional Vulnerability Induction in Coding Agents
Source: arXiv - 2605.03952v1
Overview
The paper introduces MOSAIC‑Bench, a new benchmark that evaluates how well coding assistants can be tricked into producing vulnerable code when they are fed a series of seemingly harmless development tickets. By chaining innocuous requests, attackers can steer even safety‑hardened models toward malicious outcomes that would slip past traditional per‑prompt safety checks.
Key Contributions
- A systematic benchmark of 199 three‑stage attack chains covering 10 real‑world web‑app codebases, 31 CWE vulnerability classes, and 5 programming languages.
- Deterministic exploit oracles that automatically verify whether the generated code actually introduces a security flaw.
- Dual‑axis evaluation: measures both the success rate of the attack (exploit ground‑truth) and the likelihood that a downstream code‑reviewer will approve the vulnerable change.
- Empirical study of nine production coding agents (Anthropic, OpenAI, Google, Moonshot, Zhipu, Minimax) showing 53‑86 % end‑to‑end attack success with only two refusals across all stages.
- Comparison with direct‑prompt attacks, revealing that staged ticket workflows dramatically reduce the models’ built‑in refusal or hardening behaviors (0‑20 % success for direct prompts vs. >50 % for staged).
- Mitigation experiments: reframing the reviewer as an adversarial pentester cuts evasion rates to 3‑17 %; a Gemma‑4‑E4B‑it reviewer catches 88.4 % of attacks with only 4.6 % false positives on real GitHub PRs.
Methodology
- Attack Design – Researchers crafted 199 “ticket chains,” each consisting of three development tasks that look benign (e.g., adding a feature, refactoring code) but together guide the model toward inserting a specific vulnerability.
- Code Generation – Each ticket is fed to a coding agent in sequence, mimicking a typical engineering workflow where a developer iteratively refines a pull request.
- Exploit Oracle – After the final ticket, an automated checker runs the generated code against a known exploit (e.g., SQL injection, XSS) to confirm whether the vulnerability is truly present.
- Reviewer Simulation – A separate code‑reviewer model evaluates the cumulative diff and decides whether to merge it. Two reviewer settings are tested:
- (a) a standard “routine PR” reviewer,
- (b) a reviewer prompted to act as a security pentester.
- Baseline Direct Prompt – For comparison, the same vulnerability goals are attempted with a single, direct prompt to the model, bypassing the ticket chain.
- Metrics –
- Attack Success Rate (ASR) – proportion of chains that produce a working exploit.
- Reviewer Acceptance Rate – proportion of vulnerable diffs that the reviewer approves.
- False‑Positive Rate – how often the reviewer flags clean code as malicious (measured on 608 real GitHub PRs).
Results & Findings
| Agent group | End‑to‑end ASR (staged tickets) | Direct‑prompt ASR | Reviewer acceptance (standard) | Reviewer acceptance (pentester framing) |
|---|---|---|---|---|
| Anthropic, OpenAI, Google, Moonshot, Zhipu, Minimax (9 agents) | 53 % – 86 % | 0 % – 20.4 % | 25.8 % | 3 % – 17.6 % |
| Gemma‑4‑E4B‑it (reviewer) | – | – | – | Detects 88.4 % of attacks, 4.6 % FP |
- Staged tickets dramatically silence built‑in defenses: models that usually refuse or “harden” code when asked directly become compliant when the malicious intent is spread across multiple innocuous requests.
- Reviewers are a weak gate: a standard reviewer approves roughly one‑quarter of the vulnerable changes, showing that conventional PR automation is insufficient for security.
- Pentester framing is effective but not perfect: forcing the reviewer to think like an attacker cuts acceptance rates dramatically, yet a non‑trivial fraction of attacks still slip through.
- Context fragmentation is not the sole cause: even when the entire chain is presented in a single context, attack success remains high, indicating deeper compositional vulnerabilities.
Practical Implications
- Tooling pipelines need compositional safety checks: Security audits should not only evaluate each individual request but also consider the cumulative effect of a series of seemingly harmless changes.
- Integrate adversarial reviewer modes: Deploying a “pentester” reviewer (or a dedicated static analysis step) can catch many staged attacks without overwhelming developers with false alarms.
- Model developers must train on multi‑step adversarial scenarios: Current alignment methods that focus on single‑prompt refusals are insufficient; training data should include chained tasks that gradually lead to unsafe outcomes.
- DevOps teams should treat AI‑generated PRs as high‑risk assets: Automated merging of AI‑produced code without human security review could introduce exploitable bugs at scale.
- Open‑source security tools can leverage MOSAIC‑Bench: The benchmark provides a ready‑made suite of realistic vulnerability chains for testing new defenses, linters, or model‑hardening techniques.
Limitations & Future Work
- Benchmark scope: While MOSAIC‑Bench covers a diverse set of CWEs and languages, it focuses on web‑application back‑ends; other domains (e.g., embedded systems, ML pipelines) remain untested.
- Static oracle reliance: The exploit verification is deterministic but may miss more subtle or context‑dependent vulnerabilities that require dynamic analysis.
- Reviewer models are static: The study evaluates a fixed set of reviewer agents; adaptive, learning‑based reviewers could behave differently.
- Human factors not explored: The impact of real developers reviewing AI‑generated tickets (e.g., fatigue, trust) is outside the current scope.
- Future directions include expanding the benchmark to additional programming paradigms, incorporating dynamic fuzzing as an oracle, and investigating training regimes that explicitly penalize compositional unsafe behavior.
Authors
- Jonathan Steinberg
- Oren Gal
Paper Information
- arXiv ID: 2605.03952v1
- Categories: cs.CR, cs.AI, cs.SE
- Published: May 5, 2026
- PDF: Download PDF