[Paper] Analyzing Code Injection Attacks on LLM-based Multi-Agent Systems in Software Development
Source: arXiv - 2512.21818v1
Overview
The paper investigates how large‑language‑model (LLM)‑powered multi‑agent systems—where autonomous “coder”, “reviewer”, and “tester” bots collaborate to write software—can be compromised by code‑injection attacks. By building a concrete threat model and running systematic experiments, the authors show that while these agents can produce high‑quality code, their autonomy also makes them blind to malicious inputs, exposing a new attack surface for developers and enterprises that plan to adopt AI‑driven development pipelines.
Key Contributions
- Architecture Blueprint – Proposes three LLM‑based multi‑agent designs for the implementation phase of software engineering:
- Coder‑only
- Coder‑Tester
- Coder‑Reviewer‑Tester (the “triad” architecture).
- Threat Model for Agentic Development – Formalizes how an adversary can inject malicious code snippets or “poisonous” few‑shot examples into the communication streams of the agents.
- Empirical Vulnerability Study – Demonstrates that all three architectures are susceptible to code‑injection, with success rates ranging from 0 % (baseline) to >70 % when the attacker uses crafted few‑shot prompts.
- Security‑Analysis Agent – Introduces a dedicated “security analyst” bot that inspects generated code before it reaches the reviewer or tester, improving resilience without sacrificing throughput.
- Quantitative Trade‑off Analysis – Measures the impact of each architecture on code‑generation speed, correctness, and attack resistance, revealing the triad + security analyst as the most robust yet still efficient configuration.
Methodology
- System Prototyping – Implemented the three agentic pipelines using OpenAI’s GPT‑4 (or comparable LLM) as the underlying code generator, reviewer, tester, and security analyst.
- Attack Scenarios – Crafted two families of attacks:
- Direct Injection: Malicious code embedded in the problem description.
- Few‑Shot Poisoning: Malicious examples placed in the demonstration (few‑shot) prompts that the LLM uses for in‑context learning.
- Metrics – Evaluated each pipeline on:
- Correctness: Pass rate of generated unit tests.
- Efficiency: Time and token usage per feature.
- Resilience: Percentage of attacks that resulted in successful malicious code execution.
- Statistical Validation – Ran each experiment 200 + times across diverse programming tasks (Python, JavaScript, Go) and applied chi‑square tests to confirm significance of observed differences.
Results & Findings
| Architecture | Avg. Correctness (✓) | Avg. Runtime (s) | Attack Success Rate |
|---|---|---|---|
| Coder‑only | 84 % | 12 | 48 % (direct) / 0 % (few‑shot) |
| Coder‑Tester | 78 % | 18 | 55 % / 0 % |
| Coder‑Reviewer‑Tester | 81 % | 22 | 31 % / 0 % |
| Triad + Security Analyst | 79 % | 24 | 5 % (direct) / 71.95 % (few‑shot poisoning) |
Key takeaways
- Adding a reviewer reduces the raw success of naive injection attacks but slows the pipeline.
- The security‑analysis agent cuts direct‑injection success dramatically (down to 5 %) while keeping overall correctness comparable.
- However, when the attacker embeds malicious few‑shot examples, the security analyst itself can be fooled, pushing the success rate up to ~72 %.
- The trade‑off between speed and safety is evident: more checks improve security but increase latency.
Practical Implications
- Tooling Vendors – Companies building AI‑assisted IDE plugins or CI/CD bots should embed a dedicated security‑analysis stage (e.g., static analysis + LLM‑based threat detection) before code is merged.
- DevOps Pipelines – Automated pipelines that rely on “write‑once‑run‑anywhere” LLM agents must retain a human‑in‑the‑loop checkpoint for any code that originates from LLM prompts, especially when few‑shot examples are supplied.
- Policy Makers – Security guidelines for AI‑generated code should explicitly address prompt‑level sanitization and forbid untrusted few‑shot data.
- Developers – When using LLM copilots, treat any code that appears in prompt examples as untrusted input; run linters, dependency scanners, and sandboxed execution before deployment.
- Future Products – The paper’s architecture can be repurposed for other domains (infrastructure‑as‑code, data‑pipeline generation) where autonomous agents produce artefacts that must be vetted for malicious payloads.
Limitations & Future Work
- Model Scope – Experiments were limited to a single LLM family (GPT‑4); results may differ with open‑source models that have different token‑budget or instruction‑following behaviours.
- Attack Diversity – Only two attack vectors were explored; more sophisticated supply‑chain or side‑channel attacks remain untested.
- Human Factors – The study assumes a fully autonomous pipeline; real‑world deployments often retain some manual review, which could alter the threat landscape.
- Scalability – Adding a security analyst increases latency; future work should investigate lightweight, on‑device detectors or parallelized verification to keep pipelines fast.
- Robust Prompt Engineering – Developing systematic defenses (e.g., prompt sanitizers, adversarial training) against few‑shot poisoning is an open research direction.
Bottom line: As LLM‑driven multi‑agent systems move from research labs into production software factories, security can no longer be an afterthought. This paper provides the first concrete evidence that code‑injection attacks can cripple autonomous coding pipelines—and offers a practical, albeit imperfect, mitigation path through a dedicated security‑analysis agent. Developers and platform builders should start treating AI‑generated code with the same rigor they apply to human‑written code, especially when prompts contain external examples.
Authors
- Brian Bowers
- Smita Khapre
- Jugal Kalita
Paper Information
- arXiv ID: 2512.21818v1
- Categories: cs.SE, cs.MA
- Published: December 26, 2025
- PDF: Download PDF