[Paper] Skill-Inject: Measuring Agent Vulnerability to Skill File Attacks

Published: (February 23, 2026 at 01:59 PM EST)
5 min read
Source: arXiv

Source: arXiv - 2602.20156v1

Overview

The paper Skill‑Inject shines a light on a newly emerging attack surface in large‑language‑model (LLM) agents: skill files—plug‑in style pieces of code, data, or instructions that extend an agent’s capabilities. By injecting malicious content into these skill files, an attacker can hijack the agent to perform harmful actions. The authors introduce a systematic benchmark to measure how vulnerable popular LLM agents are to such “skill‑based prompt injection” attacks.

Key Contributions

  • SkillInject benchmark – a curated suite of 202 injection‑task pairs covering a spectrum from blatant malicious payloads to subtle, context‑dependent tricks hidden in otherwise benign skill definitions.
  • Comprehensive evaluation of state‑of‑the‑art LLM agents (including frontier models) on both security (avoidance of harmful instructions) and utility (correct execution of legitimate tasks).
  • Empirical evidence that up to 80 % of attacks succeed on current agents, leading to severe outcomes such as data exfiltration, destructive commands, and ransomware‑like behavior.
  • Demonstration that model scaling or naïve input filtering does not substantially mitigate the problem.
  • A concrete call for context‑aware authorization frameworks as a more promising direction for robust agent security.

Methodology

  1. Threat Model Definition – The authors assume an attacker can modify or supply a skill file that the agent will load at runtime. The skill may contain arbitrary text, code snippets, or tool‑calling instructions.
  2. Benchmark Construction
    • Task selection: 202 real‑world‑inspired tasks (e.g., “summarize a document”, “schedule a meeting”).
    • Injection design: For each task, a paired malicious payload is crafted. Some payloads are obvious (e.g., “delete all files”), while others are stealthy (e.g., embedding a harmful command inside a legitimate data‑processing routine).
    • Success criteria:
      • Security success – the agent refuses to execute the malicious instruction.
      • Utility success – the agent still completes the original legitimate task.
  3. Agent Evaluation – Multiple open‑source and commercial LLM agents (GPT‑4‑based, Claude‑based, Llama‑2‑based, etc.) are run with the skill files injected. The authors record whether the agent obeys the malicious instruction, ignores it, or partially complies.
  4. Analysis – Attack success rates are broken down by model size, prompting style, and presence of simple filters (e.g., “do not execute code” prompts).

Results & Findings

MetricBest‑case (most secure)Worst‑case (most vulnerable)
Attack success rate~30 % (some smaller models with aggressive safety prompts)≈80 % (frontier GPT‑4‑style agents)
Utility retention70–85 % of tasks still completed correctly55–70 % (many agents either abort or execute the malicious command)
Common failure modes– Ignoring tool calls that contain suspicious strings.
– Over‑reliance on keyword‑based filters.
– Blindly executing any code block inside a skill file.
– Treating skill content as “trusted” regardless of provenance.

Key takeaways

  • Even the most advanced LLM agents can be tricked into performing high‑impact malicious actions (e.g., reading arbitrary files, sending them over the network).
  • Simple mitigations like “reject any instruction containing ‘delete’” are easily bypassed by re‑phrasing or embedding the command in a benign‑looking function.
  • The problem does not disappear when moving to larger models; in fact, larger models often follow instructions more faithfully, increasing the risk.

Practical Implications

  • Supply‑chain hygiene: Organizations that ship LLM agents with third‑party skill libraries must treat those libraries as critical attack surfaces—similar to how software dependencies are vetted today.
  • Runtime authorization: Agents should enforce policy checks before executing any code or tool call that originates from a skill file, possibly requiring signed skill packages or sandboxed execution environments.
  • Developer tooling: IDE‑style linters for skill files could flag potentially dangerous patterns (e.g., unrestricted file system access, network calls).
  • Compliance & Auditing: Companies deploying LLM agents in regulated domains (finance, healthcare) will need to demonstrate that skill ingestion pipelines are secure‑by‑design, otherwise they risk liability for data breaches caused by skill injection.
  • Product design: Platform providers (OpenAI, Anthropic, etc.) may need to expose fine‑grained permission APIs (read/write, network, tool usage) that agents can query at runtime, similar to mobile app permission models.

Limitations & Future Work

  • Benchmark scope: While 202 injection‑task pairs cover a broad range, they are still a curated set; real‑world attackers may devise novel obfuscation techniques not captured here.
  • Model diversity: The study focuses on a handful of publicly known agents; closed‑source or highly customized deployments could behave differently.
  • Static analysis only: The authors evaluate agents at inference time but do not explore static verification of skill files (e.g., type‑checking, formal methods).
  • Future directions suggested include: building automated skill‑file sanitizers, designing formal authorization logics for LLM agents, and extending the benchmark to cover multi‑agent collaboration scenarios where one compromised skill could affect an entire ecosystem.

Authors

  • David Schmotz
  • Luca Beurer‑Kellner
  • Sahar Abdelnabi
  • Maksym Andriushchenko

Paper Information

  • arXiv ID: 2602.20156v1
  • Categories: cs.CR, cs.LG
  • Published: February 23, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »