[Paper] Detecting Data Poisoning in Code Generation LLMs via Black-Box, Vulnerability-Oriented Scanning

Published: 2 days ago (March 17, 2026 at 06:08 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2603.17174v1

Overview

Code generation models such as GitHub Copilot, CodeLlama, and StarCoder are now everyday assistants for developers. However, recent research shows that these models can be poisoned—an attacker subtly tweaks the training data so the model repeatedly emits insecure code snippets. The paper Detecting Data Poisoning in Code Generation LLMs via Black‑Box, Vulnerability‑Oriented Scanning introduces CodeScan, the first black‑box tool that can spot a compromised model by looking for recurring vulnerable code patterns, even when the same logic is expressed with different syntax.

Key Contributions

Code‑centric scanning: A detection framework that works on the structure of generated source code rather than raw token similarity, handling the many ways the same semantics can be written.
Iterative divergence analysis: Generates multiple completions from the same model using diverse clean prompts, then isolates code fragments that consistently appear across those completions.
AST‑based normalization: Converts each fragment into an abstract syntax tree (AST) to strip away superficial differences (whitespace, variable names, ordering) and focus on the underlying logic.
LLM‑driven vulnerability check: Re‑uses a (separate) LLM to evaluate whether the normalized fragment contains a known security flaw (e.g., command injection, insecure deserialization, unsafe memory handling).
Comprehensive evaluation: Tested on 108 models (3 architectures, multiple sizes) against four state‑of‑the‑art poisoning/backdoor attacks covering three real‑world vulnerability classes, achieving > 97 % detection accuracy with far fewer false positives than prior token‑level scanners.

Methodology

Prompt diversification – The scanner feeds the target model a set of clean programming prompts (e.g., “write a function to read a file”).
Multiple generations – For each prompt, it collects several completions (e.g., 5–10) using the same model but with slight variations in temperature or sampling seed.
Divergence detection – It compares the completions pairwise to locate common sub‑structures that appear in most outputs, assuming these are the parts the model is “forced” to emit (potentially the poisoned payload).
AST normalization – Each candidate sub‑structure is parsed into an AST; node types, control‑flow patterns, and API calls are extracted while renaming identifiers and discarding formatting quirks. This collapses syntactically different but semantically identical code.
Vulnerability assessment – A separate, well‑trained LLM (or a static analysis tool) is prompted with the normalized fragment to decide whether it matches a known vulnerability pattern.
Decision rule – If a vulnerable fragment is found in the recurring set, the model is flagged as compromised; otherwise it is considered clean.

All steps are black‑box: they only require the ability to query the target model for completions, no access to its weights or training data.

Results & Findings

Scenario	Attack type	Detection accuracy	False‑positive rate
Backdoor (trigger word)	Code injection	98.3 %	1.2 %
Data poisoning (label‑flipping)	Insecure deserialization	97.7 %	0.9 %
Mixed (both)	Command injection	97.9 %	1.0 %
Baseline token‑level scanner	–	71.4 %	5.8 %

Robustness across model families: CodeScan performed consistently on GPT‑style, encoder‑decoder, and decoder‑only code models, from 1 B to 13 B parameters.
Low overhead: Scanning a model with 10 prompts × 5 generations each took ~2 minutes on a single GPU, making it feasible for CI pipelines.
Resilience to obfuscation: Because the detection works on ASTs, simple renaming or reordering tricks used by attackers did not evade the scanner.

Practical Implications

CI/CD safety gate – Teams can integrate CodeScan into their model‑deployment pipeline to automatically reject any newly fine‑tuned code model that exhibits hidden malicious patterns.
Marketplace vetting – Model providers (e.g., Hugging Face, AWS Bedrock) can run CodeScan on third‑party uploads, offering customers a “poison‑free” certification badge.
Developer tooling – IDE plugins could query a model via CodeScan’s API before presenting generated snippets, warning developers if the suggestion contains a known vulnerability.
Regulatory compliance – Organizations subject to secure‑coding standards (e.g., OWASP, ISO 27034) can use the tool as evidence that their AI‑assisted code generation complies with security policies.

Overall, CodeScan shifts the defense posture from reactive (patching insecure code after it’s written) to preventive (detecting a compromised model before it ever generates code).

Limitations & Future Work

Dependency on prompt diversity: If the clean prompts do not exercise the vulnerable functionality, the recurring pattern may never surface.
Vulnerability coverage: The LLM‑based analyzer is only as good as the vulnerability taxonomy it has seen; novel or zero‑day exploits could slip through.
Scalability to massive models: While the current runtime is modest, scanning extremely large models (e.g., > 100 B parameters) may require distributed inference.
Adversarial adaptation: Attackers could design poisoning payloads that deliberately vary across generations, undermining the recurrence assumption.

Future research directions include: expanding the vulnerability database with community‑sourced CVEs, combining static analysis with dynamic sandbox execution for higher confidence, and exploring white‑box extensions that leverage model internals when available.

CodeScan demonstrates that a structural, vulnerability‑oriented lens can reliably expose poisoned code generation models without needing any privileged access—a promising step toward safer AI‑assisted software development.

Authors

Shenao Yan
Shimaa Ahmed
Shan Jin
Sunpreet S. Arora
Yiwei Cai
Yizhen Wang
Yuan Hong

Paper Information

arXiv ID: 2603.17174v1
Categories: cs.CR, cs.AI, cs.SE
Published: March 17, 2026
PDF: Download PDF

[Paper] Detecting Data Poisoning in Code Generation LLMs via Black-Box, Vulnerability-Oriented Scanning

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] NavTrust: Benchmarking Trustworthiness for Embodied Navigation

[Paper] FinTradeBench: A Financial Reasoning Benchmark for LLMs

[Paper] F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World

[Paper] Spectrally-Guided Diffusion Noise Schedules