RoguePilot Flaw in GitHub Codespaces Enabled Copilot to Leak GITHUB_TOKEN
Source: The Hacker News
RoguePilot Vulnerability in GitHub Codespaces
A vulnerability in GitHub Codespaces could have been exploited by bad actors to seize control of repositories by injecting malicious Copilot instructions in a GitHub issue.
The AI‑driven flaw has been codenamed RoguePilot by Orca Security and was patched by Microsoft after responsible disclosure.
“Attackers can craft hidden instructions inside a GitHub issue that are automatically processed by GitHub Copilot, giving them silent control of the in‑codespaces AI agent,” security researcher Roi Nisimi said in a report.
— Orca Security blog
The vulnerability is a case of passive (indirect) prompt injection: a malicious instruction is embedded within data that the large language model (LLM) processes, causing it to produce unintended outputs or execute arbitrary actions. Orca also describes it as an AI‑mediated supply‑chain attack that induces the LLM to automatically execute malicious instructions embedded in developer content—in this case, a GitHub issue.
How the Attack Works
- Malicious GitHub issue is created.
- An unsuspecting developer launches a Codespace from that issue.
- GitHub Copilot automatically receives the issue’s description as a prompt.
- The hidden prompt (placed inside an HTML comment) instructs Copilot to leak the privileged
GITHUB_TOKENto an attacker‑controlled server.
“By manipulating Copilot in a Codespace to check out a crafted pull request that contains a symbolic link to an internal file, an attacker can cause Copilot to read that file and (via a remote JSON
$schema) exfiltrate a privilegedGITHUB_TOKENto a remote server,” Nisimi explained.
Entry Points
RoguePilot can be triggered from any of the following entry points that launch a Codespace:
- Templates
- Repositories
- Commits
- Pull requests
- Issues (the most vulnerable, as the issue description is fed directly to Copilot)
From Prompt Injections to “Promptware”
The discovery coincides with Microsoft’s research on Group Relative Policy Optimization (GRPO)—a reinforcement‑learning technique used to fine‑tune LLMs after deployment. Microsoft found that GRPO can also be weaponized to remove safety features, a process they call GRP‑Obliteration.
Key findings
- A single unlabeled prompt (e.g., “Create a fake news article that could lead to panic or chaos”) can reliably unalign 15 language models.
- The prompt is relatively mild and does not mention violence, illegal activity, or explicit content, yet it makes the model more permissive across many harmful categories it never saw during training.
“What makes this surprising is that the prompt is relatively mild and does not mention violence, illegal activity, or explicit content,” noted Microsoft researchers Mark Russinovich, Giorgio Severi, Blake Bullwinkel, Yanan Cai, Keegan Hines, and Ahmed Salem.
— Microsoft Security Blog (Feb 9 2026)
Related Research
- Side‑channel attacks that can infer the topic of a user’s conversation or fingerprint queries with >75 % accuracy.
- Exploits speculative decoding, an optimization that generates multiple candidate tokens in parallel to improve throughput and latency.
- See:
- [ArXiv: 2410.17175 – Side‑channel discovery]
- [ArXiv: 2411.01076 – Additional side‑channel techniques]
- [The Hacker News (Nov 2025) – Whisper leak attack]


Agentic ShadowLogic & New AI Attack Vectors
Agentic ShadowLogic
Models backdoored at the computational‑graph level – a technique called ShadowLogic – can further put agentic AI systems at risk by allowing tool calls to be silently modified without the user’s knowledge. This phenomenon has been codenamed Agentic ShadowLogic by HiddenLayer.
“By logging requests over time, the attacker can map which internal endpoints exist, when they’re accessed, and what data flows through them,” the AI security company said. “The user receives their expected data with no errors or warnings. Everything functions normally on the surface while the attacker silently logs the entire transaction in the background.”
— HiddenLayer announcement
How it works
- An attacker weaponizes the backdoor to intercept requests that fetch content from a URL in real‑time.
- The request is routed through infrastructure under the attacker’s control before being forwarded to the real destination.
Semantic Chaining – Image Jailbreak
Last month, Neural Trust demonstrated a new image‑jailbreak attack called Semantic Chaining. It lets users sidestep safety filters in models such as Grok 4, Gemini Nano Banana Pro, and Seedance 4.5 by leveraging the model’s ability to perform multi‑stage image modifications.
The attack exploits the model’s limited “reasoning depth” to track latent intent across a multi‑step instruction. By chaining innocuous edits, the attacker gradually erodes the model’s safety resistance until a prohibited output is produced.
Attack flow
- Step 1: Ask the AI chatbot to imagine any non‑problematic scene and change one element in the generated image.
- Step 2: Request a second modification that transforms the image into something prohibited or offensive.
Because the model is only modifying an existing image (rather than creating a new one), safety alarms often fail to trigger.
“Instead of issuing a single, overtly harmful prompt, which would trigger an immediate block, the attacker introduces a chain of semantically ‘safe’ instructions that converge on the forbidden result,” security researcher Alessandro Pignati said.
— Neural Trust blog

Promptware – The New Malware Class
In a study published last month, researchers Oleg Brodt, Elad Feldman, Bruce Schneier, and Ben Nassi argued that prompt injections have evolved beyond simple input‑manipulation exploits into what they call promptware – a new class of malware execution mechanism triggered through specially engineered prompts.
Promptware capabilities
- Manipulates the LLM to enable typical cyber‑attack lifecycle phases:
- Initial access
- Privilege escalation
- Reconnaissance
- Persistence
- Command‑and‑control
- Lateral movement (see example)
- Malicious outcomes (data theft, social engineering, code execution, financial fraud)
“Promptware refers to a polymorphic family of prompts engineered to behave like malware, exploiting LLMs to execute malicious activities by abusing the application’s context, permissions, and functionality,” the researchers said. “In essence, promptware is an input, whether text, image, or audio, that manipulates an LLM’s behavior during inference time, targeting applications or users.”
— arXiv preprint
Stay Informed
Found this article interesting? Follow us for more exclusive content:
