Why AI Agents Don't Follow Rules — The Case for Physical Governance

Published: (April 6, 2026 at 07:18 PM EDT)
3 min read
Source: Dev.to

Source: Dev.to

The Fact That Started This

A repository had over 130 KB of governance documentation.
The AI agent read it, acknowledged it, then violated it on the next tool call.

This is not a failure of instruction. It is a failure of architecture.

Why Textual Rules Fail

The current standard approach to AI agent governance is: write a rule in a prompt.

Rules

  • Never edit the evals/ directory
  • Write operations to 00_Management/ are forbidden

These rules have a structural flaw. Textual rules enforce at read time. They assume the agent will choose compliance, but there is no mechanism that enforces this choice at execution time.

This is why rm -rf / requires a confirmation flag, not a policy document. Physical constraints enforce at execution time, while textual rules enforce at reading time — the wrong moment.

The Verification Contamination Problem

If an agent can evaluate its own output, it can contaminate the evaluation criteria — not intentionally, but by carrying the same failure modes from generation into evaluation. A system where tests always pass may be a system where tests don’t work.

What AOS Defines

AI Operating Standard (AOS) defines the minimum physical‑constraint layer for AI agent operations in a shared codebase. It consists of three components.

1. Zones — Classify every path

ZoneClassWrite Permission
OracleRead‑only, absoluteNo agent may write
PermittedAgent workspaceAllowed within role limits
ProhibitedOut of scopeSovereign authorization only

2. Roles — Non‑overlapping responsibilities

Three roles: Architect, Executor, Sovereign.
An agent MUST NOT act outside its assigned role. When a role boundary is reached, the agent stops and escalates to a human.

3. Physical Enforcement — Intercept at execution time

A PreToolUse hook blocks write operations before filesystem access occurs.

  • Write to Oracle Zone → exit 2 (call is never executed)
  • Destructive patterns (sed -i, perl -i, truncate) → exit 2

No assumption of agent goodwill; physical law enforces compliance.

Reference Implementation: iron_cage

iron_cage is the AOS reference implementation. It implements §§ 4.1–4.5 via Claude Code’s PreToolUse hook system.

Behind iron_cage is a design principle called Type‑91 Governance:

  • Forensic isolation — tamper‑evident physical evidence trails
  • Physical isolation — agents cannot modify their own evaluation criteria

The scripts are the surface; the architecture runs deeper. AOS is the standard, and iron_cage proves that it works.

Specification (AOS‑v0.1)

Feed the Spec to the Agent

The specification is not written only for human readers. AOS-v0.1.md opens with §0: Machine‑Reading Instructions. Loading this spec into an agent’s context window lets the agent understand—at specification level—what it must not do.

Not “do not do X because the prompt says so.”
“Do not do X because the specification defines it as a hard constraint with a physical enforcement mechanism.”

This is the second design intent of AOS: agents that read the spec become self‑constraining.

Why Now

In 2026, “how do you trust what an AI agent produced” remains unsolved. Most teams still try to solve it with prompts, but there is no standard for the physical governance layer. Someone has to define it—AOS is that attempt.

This Is a Draft

AOS v0.1 is not a finished standard. Issues, pull requests, and implementation reports are welcome.

Repository:

0 views
Back to Blog

Related posts

Read more »