Models Self-Censor When Policy Gates Exist
Source: Dev.to
There’s something interesting happening with AI agents that most people haven’t noticed yet.
When you put a hard policy gate in front of a model—a deterministic block on certain actions—the model starts behaving differently. It stops trying to do things that will get blocked and adapts to the boundaries, working within them.
This isn’t about fine‑tuning or prompt engineering. It’s about how models respond to consistent, enforceable constraints.
The Guardrail Problem
Most AI safety today relies on another AI watching the first one. You tell a guardrail model “don’t let the agent delete the database” and hope it listens. But guardrails have their own problems. Recent research from Harvard showed that ChatGPT’s guardrail sensitivity varies based on factors such as the user’s favorite sports team—Chargers fans were refused more often than Eagles fans on certain requests. Women were refused more than men on requests for censored information.
This is what happens when you use probabilistic systems to check other probabilistic systems: the results are inconsistent and sometimes just weird.
Researchers distinguish two types of censorship in LLMs:
- Hard censorship – the model explicitly refuses to answer, e.g., “I can’t help with that.”
- Soft censorship – the model omits information or downplays certain elements while still responding.
Both are unpredictable when the rules are fuzzy.
What Changes With Hard Boundaries
Put the same model behind a deterministic policy gate and something shifts.
- The gate doesn’t reason, get tired, or get confused—it simply checks actions against rules written in code. If the rule says “no,” it’s no—every time.
- The model learns this quickly. It stops generating actions that will hit the deny rule, not because it understands ethics or safety, but because those actions reliably fail. The agent’s job is to accomplish tasks, and wasting tokens on things that always get blocked doesn’t help.
- This is the opposite of how models behave with probabilistic guardrails. When another model watches and might be tricked, agents probe, rephrase, and look for wording that slips through, creating an adversarial dynamic.
- Hard boundaries remove that adversarial dynamic. The model can’t talk its way out of a regex or a type check, so it stops trying.
What This Looks Like
Teams running customer‑support agents have noticed this pattern. Before hard limits were in place, agents occasionally suggested refunds above policy limits. The guardrail caught most of them, but some slipped through.
After adding a simple rule—if amount > 500 then deny—the behavior changed within hours. The agent stopped suggesting large refunds entirely, started offering store credit, escalated to humans, and found alternatives that worked within the boundary.
A similar pattern appears with shell commands. Blocking rm -rf hard enough makes agents stop generating destructive commands; they simply don’t bother.
This isn’t the model becoming morally better; it’s optimizing for success within constraints.
Why This Matters
The security industry has long worried that AI models will be too creative at finding ways around constraints, that they’ll jailbreak their way past any barrier.
Consistent constraints change behavior. When a model learns that certain actions always fail, those branches are pruned from its effective action space. The path of least resistance becomes staying within the lines.
Implications extend beyond safety:
- Models become more predictable and reliable.
- They are easier to put into production without constant fear of unexpected behavior.
- The mechanism is simple efficiency—models constantly make micro‑decisions about what to try, and forbidden actions that always fail are quickly abandoned.
The Takeaway
If you’re building agents that actually do things in the world, this is worth paying attention to. The way you constrain an agent doesn’t just protect your systems; it shapes how the agent behaves. A well‑designed policy layer becomes part of the agent’s decision process, not just an external check.
The agent learns to work with the boundaries instead of against them.
I’m building Faramesh, which implements this idea in practice—hard policy gates for AI agents. More information at faramesh.dev.