Why AI Agents Don't Follow Rules — The Case for Physical Governance

Published: 0 month ago (April 6, 2026 at 07:18 PM EDT)

3 min read

Source: Dev.to

Source: Dev.to

The Fact That Started This

A repository had over 130 KB of governance documentation.
The AI agent read it, acknowledged it, then violated it on the next tool call.

This is not a failure of instruction. It is a failure of architecture.

Why Textual Rules Fail

The current standard approach to AI agent governance is: write a rule in a prompt.

Rules

Never edit the evals/ directory
Write operations to 00_Management/ are forbidden

These rules have a structural flaw. Textual rules enforce at read time. They assume the agent will choose compliance, but there is no mechanism that enforces this choice at execution time.

This is why rm -rf / requires a confirmation flag, not a policy document. Physical constraints enforce at execution time, while textual rules enforce at reading time — the wrong moment.

The Verification Contamination Problem

If an agent can evaluate its own output, it can contaminate the evaluation criteria — not intentionally, but by carrying the same failure modes from generation into evaluation. A system where tests always pass may be a system where tests don’t work.

What AOS Defines

AI Operating Standard (AOS) defines the minimum physical‑constraint layer for AI agent operations in a shared codebase. It consists of three components.

1. Zones — Classify every path

Zone	Class	Write Permission
Oracle	Read‑only, absolute	No agent may write
Permitted	Agent workspace	Allowed within role limits
Prohibited	Out of scope	Sovereign authorization only

2. Roles — Non‑overlapping responsibilities

Three roles: Architect, Executor, Sovereign.
An agent MUST NOT act outside its assigned role. When a role boundary is reached, the agent stops and escalates to a human.

3. Physical Enforcement — Intercept at execution time

A PreToolUse hook blocks write operations before filesystem access occurs.

Write to Oracle Zone → exit 2 (call is never executed)
Destructive patterns (sed -i, perl -i, truncate) → exit 2

No assumption of agent goodwill; physical law enforces compliance.

Reference Implementation: `iron_cage`

iron_cage is the AOS reference implementation. It implements §§ 4.1–4.5 via Claude Code’s PreToolUse hook system.

Behind iron_cage is a design principle called Type‑91 Governance:

Forensic isolation — tamper‑evident physical evidence trails
Physical isolation — agents cannot modify their own evaluation criteria

The scripts are the surface; the architecture runs deeper. AOS is the standard, and iron_cage proves that it works.

Specification (AOS‑v0.1)

Feed the Spec to the Agent

The specification is not written only for human readers. AOS-v0.1.md opens with §0: Machine‑Reading Instructions. Loading this spec into an agent’s context window lets the agent understand—at specification level—what it must not do.

Not “do not do X because the prompt says so.”
“Do not do X because the specification defines it as a hard constraint with a physical enforcement mechanism.”

This is the second design intent of AOS: agents that read the spec become self‑constraining.

Why Now

In 2026, “how do you trust what an AI agent produced” remains unsolved. Most teams still try to solve it with prompts, but there is no standard for the physical governance layer. Someone has to define it—AOS is that attempt.

This Is a Draft

AOS v0.1 is not a finished standard. Issues, pull requests, and implementation reports are welcome.

Repository:

Why AI Agents Don't Follow Rules — The Case for Physical Governance

The Fact That Started This

Why Textual Rules Fail

Rules

The Verification Contamination Problem

What AOS Defines

1. Zones — Classify every path

2. Roles — Non‑overlapping responsibilities

3. Physical Enforcement — Intercept at execution time

Reference Implementation: `iron_cage`

Specification (AOS‑v0.1)

Feed the Spec to the Agent

Why Now

This Is a Draft

Related posts

Why AI Agents Need a Trust Layer (And How We Built One)

Why AI agent teams are just hoping their agents behave

I stopped trusting AI agents to “do the right thing” - so I built a governance system

What Is Model Context Protocol (MCP)? A Plain Guide for Engineers

The Fact That Started This

Why Textual Rules Fail

Rules

The Verification Contamination Problem

What AOS Defines

1. Zones — Classify every path

2. Roles — Non‑overlapping responsibilities

3. Physical Enforcement — Intercept at execution time

Reference Implementation: iron_cage

Specification (AOS‑v0.1)

Feed the Spec to the Agent

Why Now

This Is a Draft

Related posts

Why AI Agents Need a Trust Layer (And How We Built One)

Why AI agent teams are just hoping their agents behave

I stopped trusting AI agents to “do the right thing” - so I built a governance system

What Is Model Context Protocol (MCP)? A Plain Guide for Engineers

Reference Implementation: `iron_cage`