Don't trust AI agents

Published: 3 days ago (February 28, 2026 at 07:39 AM EST)

6 min read

Source: Hacker News

Building with AI Agents: Assume They’re Untrusted

When you’re building with AI agents, they should be treated as untrusted and potentially malicious. Whether you’re worried about:

Prompt injection
A model trying to escape its sandbox
Or something nobody’s thought of yet

…regardless of your threat model, you shouldn’t be trusting the agent.

The Right Approach

The solution isn’t:

Better permission checks
Smarter allow‑lists

Instead, design architecture that assumes agents will misbehave and contains the damage when they do.

That’s the principle I built NanoClaw on.

Don’t Trust the Process

OpenClaw runs directly on the host machine by default. It offers an opt‑in Docker sandbox mode, but this mode is disabled out of the box, and most users never enable it. Consequently, security relies entirely on application‑level checks such as:

Allowlists
Confirmation prompts
A predefined set of “safe” commands

These checks assume implicit trust that the agent will not act maliciously. If you adopt the mindset that an agent could be hostile, it becomes clear that application‑level blocks are insufficient—they do not provide hermetic security. A determined or compromised agent can find ways to bypass them.

NanoClaw’s Approach: Container Isolation

In NanoClaw, container isolation is a core architectural principle:

Feature	Description
Per‑agent containers	Each agent runs in its own Docker (or Apple Container on macOS) instance.
Ephemeral lifecycle	Containers are created fresh for each invocation and destroyed afterward.
Unprivileged execution	The agent runs as a non‑root user inside the container.
Explicit mounts only	The container can see only the directories that are explicitly mounted.
OS‑enforced boundaries	The container boundary is enforced by the operating system, providing strong isolation.

By leveraging these container guarantees, NanoClaw ensures that even a malicious or compromised agent cannot escape its sandbox or affect the host system.

Don’t Trust Other Agents

Even when OpenClaw’s sandbox is enabled, all agents share the same container. You might have a personal‑assistant agent, a work agent, a family‑group agent, etc., each operating in different WhatsApp groups or Telegram channels. Because they run in the same environment, information can leak between agents that are supposed to access different data.

Why Per‑Agent Isolation Matters

In NanoClaw each agent gets:

its own container
a dedicated filesystem (/data/)
an independent Claude session history

Thus, your personal assistant cannot see the work agent’s data, and vice‑versa.

Comparison: Shared vs. Per‑Agent Containers

Feature	Shared Container	Per‑Agent Containers
Filesystem	Single shared FS	Separate `/data/` directories
Credentials	All credentials accessible	Each agent sees only its own data
Session histories	All visible	Each agent has its own session
Mounted data	All data shared	Mounts are scoped per agent
Isolation	None – agents see everything	Agents are isolated from each other
Example layout
Personal Assistant	`/data/personal` (ro)	`/data/personal` (ro)
Work Agent	`/data/work` (rw)	`/data/work` (rw)
Family Group Agent	`/data/family` (ro)	`/data/family` (ro)

The container boundary is the hard security layer – an agent cannot escape it regardless of configuration.

Defense‑in‑Depth: Mount Allowlist

A mount allowlist located at

~/.config/nanoclaw/mount-allowlist.json

provides an additional safeguard:

Purpose: Prevent the user from accidentally mounting sensitive paths, not to stop an agent from breaking out.
Defaults: Sensitive directories/files such as .ssh, .gnupg, .aws, .env, private_key, credentials are blocked.
Location: The allowlist lives outside the project directory, so a compromised agent cannot modify its own permissions.
Host code: The host application code is mounted read‑only, ensuring nothing an agent does can persist after the container is destroyed.

Trust Model for Group Chats

Non‑main groups are untrusted by default.
Members of other groups cannot:
- Send messages to chats they don’t belong to
- Schedule tasks for other groups
- View data belonging to other groups

Since anyone in a group could attempt a prompt‑injection attack, the security model assumes the worst‑case scenario and isolates groups accordingly.

Don’t Trust What You Can’t Read

OpenClaw contains nearly half a million lines of code, 53 configuration files, and more than 70 dependencies. This scale breaks the basic premise of open‑source security.

Chromium has 35 + million lines, yet we trust Google’s review processes.
Most open‑source projects stay small enough that many eyes can actually review them.

Nobody has reviewed OpenClaw’s 400 k lines. It was written in weeks with no proper review process. Complexity is where vulnerabilities hide, and Microsoft’s analysis confirms this: OpenClaw’s risks can emerge through normal API calls because no single person can see the full picture.

NanoClaw: Small, Auditable, and Extensible

Lines of code comparison: OpenClaw (~400,000 lines) vs NanoClaw (~3,000 lines)

Size – One process and a handful of files (~3 k lines).
Dependencies – Relies heavily on Anthropic’s Agent SDK (the wrapper around Claude Code) for session management, memory compaction, etc., instead of reinventing the wheel.
Reviewability – A competent developer can audit the entire codebase in an afternoon. This is a deliberate constraint, not a limitation (philosophy).

Our contribution guidelines accept only:

Bug fixes
Security fixes
Simplifications

Skills‑Based Extensibility

New functionality arrives as skills: instructions with a full, working reference implementation that a coding agent merges into your codebase.

You review exactly what code will be added before it lands.
Only the integrations you actually need are added.
Every installation ends up as a few thousand lines of code, tailored to the owner’s exact requirements.

With a monolithic 400 k‑line codebase, even if you enable only two integrations, the rest of the code remains loaded, part of the attack surface, and reachable by prompt injections or rogue agents. You cannot disentangle what’s active from what’s dormant, nor audit it because the boundary of “your code” is undefined.

With skills, the boundary is obvious: a few thousand lines you chose to add, all of which you can read. The core is actually getting smaller over time—for example, WhatsApp support is being extracted and packaged as a skill.

Design for Distrust

If a hallucination or a misbehaving agent can cause a security issue, then the security model is broken. Security must be enforced outside the agentic surface; it cannot rely on the agent behaving correctly.

Containers, mount restrictions, and filesystem isolation exist so that, even when an agent does something unexpected, the blast radius is contained.

Key Takeaways

Risk remains – an AI agent with access to your data is inherently high‑risk.
Narrow the trust surface – make the agent’s permissions as limited and as verifiable as possible.
Don’t trust the agent – build walls around it.