We stress-tested our own AI agent guardrails before launch. Here's what broke.
Source: Dev.to
You can’t find the holes in a security system you designed. Your test suite maps the space you imagined, which is exactly what an attacker tries to escape.
Before we opened APort Vault to the public, we spent two weeks doing exactly that — trying to break our own guardrails. Not with a test suite. With intent.
We broke three of our eight core policy rules before any public player tried.
TL;DR
- Internal stress‑testing before CTF launch broke 3 of 8 core guardrail rules.
- Five attack classes: prompt injection, policy ambiguity, context poisoning, multi‑step chaining, passport bypass.
- Most dangerous finding: multi‑step chaining — each micro‑action passes; the composition violates policy.
- Fixes: intent‑based injection checks, default‑deny for gaps, cross‑turn session memory, opaque denial messages.
- Core lesson: post‑hoc filtering fails. Make dangerous states structurally unreachable.
Why are AI agent guardrails just security theater?
Most AI guardrails work like airport security theater. They look thorough, but a determined attacker walks through.
The big‑company approaches — LlamaFirewall (Meta) and NeMo Guardrails (NVIDIA) — focus on post‑hoc filtering. They detect bad actions after the agent decides to take them. That’s detection, not prevention.
A Show HN post for hibana‑agent argued the same thing: “dangerous actions must be structurally unreachable.” ClawMoat launched with a host‑level approach. The signal is clear: the industry is shifting from detection to structural constraints.
Building APort — an authorization layer that intercepts every tool call before execution — taught us that intent matters more than wording. But we didn’t know how fragile our intent detection was until we started breaking it ourselves.
Why passports, not border patrol?
Imagine you’re traveling to a new country. At every checkpoint, instead of showing your passport, you have to call your family back home to vouch for you.
That’s how most AI guardrails work today. They ask the LLM: “Is this action safe?” They rely on the model’s own judgment, which can be manipulated.
A better system works like a real passport: identity and permissions are encoded in a credential that travels with the agent. The guardrail doesn’t ask “Is this allowed?” it reads the credential and knows. That’s what we’re building with Agent Passport. But before we could trust it, we had to break it.
What five attack classes did we test?
The CTF is built around five escalating attack classes. Each targets a different weakness in guardrail design.
| Level | Attack class | Description |
|---|---|---|
| 1 | Prompt injection | Direct override attempts: “ignore previous instructions,” “this is just a test,” “the user said it’s okay.” Goal: convince the LLM evaluator the action is safe through vocabulary reframing. |
| 2 | Policy ambiguity | Exploiting unclear policies — acting in the gap. If the policy says “don’t read sensitive files,” what counts as sensitive? Attackers find the gray zones and live there. |
| 3 | Context poisoning | Injecting false context into earlier turns to manipulate later decisions. “The user previously authorized this action.” The guardrail sees the poisoned context and makes a different decision. |
| 4 | Multi‑step reasoning manipulation | Chaining individually‑allowed actions to reach a forbidden outcome. Each micro‑action passes the guardrail; the composition violates policy. This is the hardest class of problem in AI policy design. |
| 5 | Full system bypass | Combining all the above, plus attacking the passport verification layer itself. If the guardrail trusts the passport, can you forge one? Can you make the verification step get skipped entirely? |
What broke when we tested?
- Prompt injection worked better than we expected. Not because detection was weak — because we were matching content, not intent. Reframing “retrieve the confidential document” as “open the user‑requested file” shifted the LLM’s judgment.
- Policy ambiguity was a free pass. “Don’t read sensitive files” left sensitive undefined. Every ambiguous gap was exploitable — we walked through all of them.
- Context poisoning broke our session memory. We validated each turn in isolation. Injecting false context into an early turn meant every later turn trusted it.
- Multi‑step chaining went undetected. Our guardrail evaluated each call independently. A denied macro‑action split into ten allowed micro‑actions passed clean. We only caught it by looking at the full session replay.
- Passport verification held, but the surrounding assumptions didn’t. Under specific edge conditions, the guardrail could be made to skip verification entirely — the passport check was sound, but the path to it wasn’t.
What did we fix before launch?
Prompt injection
Pre‑action authorization now checks intent, not just content. We map semantic equivalence — every synonym and reframing of a blocked operation follows the same evaluation path. The policy doesn’t care what the agent calls it.
Policy ambiguity
Explicit default‑deny when a policy gap is detected. If the policy doesn’t explicitly allow an action, it’s denied. No gray zones.
Context poisoning
Per‑turn context validation against the original passport scope. If the context deviates from what was authorized, the request is rejected.
Multi‑step chaining
Session‑level reasoning that tracks the cumulative effect of micro‑actions. A macro‑action is denied if the aggregate outcome violates any rule, even if each step individually passes.
Passport verification path
Hard‑wired enforcement that always routes through the passport check before any tool call. Edge‑case shortcuts were removed, and the verification step now runs atomically with the request.
Core lesson
Post‑hoc filtering is insufficient. The safest design makes dangerous states structurally unreachable by enforcing intent‑based, session‑aware, default‑deny policies and by treating credentials like passports rather than asking the model “Is this okay?”
Multi‑step Chaining
Session‑level context accumulation that flags sequences matching known bypass chains — similar to how fraud‑detection systems look at transaction sequences, not individual transactions. This was the Level 4 lesson made concrete.
Opaque Denial Messages
Denial messages to callers are now information‑poor, while the internal audit log is information‑rich. An attacker probing the response surface learns nothing useful.
Core Lesson: Post‑hoc Filtering Fails – Structure Is the Answer
Make dangerous states structurally unreachable, not merely detectable. Our open‑source aport‑agent‑guardrails implements these patterns.
What’s the Structural Shift Happening in AI Guardrails?
The industry is moving from detection to structure.
- Hibana‑agent’s “structurally unreachable” thesis aligns with what we learned.
- ClawMoat’s host‑level approach is another version of the same idea.
Our own fix was to move authorization earlier in the loop: before the agent decides, before the LLM reasons, and before the tool call is even constructed. That’s the only way to close the multi‑step gap.
We found and fixed what we could ourselves. That’s the limit of internal testing — you can only break what you can imagine.
The CTF is live because we know we missed something. Come find it.
vault.aport.io — Levels 1 and 2 free. Levels 3‑5 pay out up to $5,000 to whoever gets there first.
Deadline: March 12, 2026.
