Why AI Agents Don't Follow Rules — The Case for Physical Governance
Source: Dev.to
The Fact That Started This
A repository had over 130 KB of governance documentation.
The AI agent read it, acknowledged it, then violated it on the next tool call.
This is not a failure of instruction. It is a failure of architecture.
Why Textual Rules Fail
The current standard approach to AI agent governance is: write a rule in a prompt.
Rules
- Never edit the
evals/directory - Write operations to
00_Management/are forbidden
These rules have a structural flaw. Textual rules enforce at read time. They assume the agent will choose compliance, but there is no mechanism that enforces this choice at execution time.
This is why rm -rf / requires a confirmation flag, not a policy document. Physical constraints enforce at execution time, while textual rules enforce at reading time — the wrong moment.
The Verification Contamination Problem
If an agent can evaluate its own output, it can contaminate the evaluation criteria — not intentionally, but by carrying the same failure modes from generation into evaluation. A system where tests always pass may be a system where tests don’t work.
What AOS Defines
AI Operating Standard (AOS) defines the minimum physical‑constraint layer for AI agent operations in a shared codebase. It consists of three components.
1. Zones — Classify every path
| Zone | Class | Write Permission |
|---|---|---|
| Oracle | Read‑only, absolute | No agent may write |
| Permitted | Agent workspace | Allowed within role limits |
| Prohibited | Out of scope | Sovereign authorization only |
2. Roles — Non‑overlapping responsibilities
Three roles: Architect, Executor, Sovereign.
An agent MUST NOT act outside its assigned role. When a role boundary is reached, the agent stops and escalates to a human.
3. Physical Enforcement — Intercept at execution time
A PreToolUse hook blocks write operations before filesystem access occurs.
- Write to Oracle Zone → exit 2 (call is never executed)
- Destructive patterns (
sed -i,perl -i,truncate) → exit 2
No assumption of agent goodwill; physical law enforces compliance.
Reference Implementation: iron_cage
iron_cage is the AOS reference implementation. It implements §§ 4.1–4.5 via Claude Code’s PreToolUse hook system.
Behind iron_cage is a design principle called Type‑91 Governance:
- Forensic isolation — tamper‑evident physical evidence trails
- Physical isolation — agents cannot modify their own evaluation criteria
The scripts are the surface; the architecture runs deeper. AOS is the standard, and iron_cage proves that it works.
Specification (AOS‑v0.1)
Feed the Spec to the Agent
The specification is not written only for human readers. AOS-v0.1.md opens with §0: Machine‑Reading Instructions. Loading this spec into an agent’s context window lets the agent understand—at specification level—what it must not do.
Not “do not do X because the prompt says so.”
“Do not do X because the specification defines it as a hard constraint with a physical enforcement mechanism.”
This is the second design intent of AOS: agents that read the spec become self‑constraining.
Why Now
In 2026, “how do you trust what an AI agent produced” remains unsolved. Most teams still try to solve it with prompts, but there is no standard for the physical governance layer. Someone has to define it—AOS is that attempt.
This Is a Draft
AOS v0.1 is not a finished standard. Issues, pull requests, and implementation reports are welcome.
Repository: