Have you ever told an AI 'never do this' and watched it do it anyway?

Published: (May 9, 2026 at 11:36 AM EDT)
3 min read
Source: Dev.to

Source: Dev.to

Introduction

I have told an AI “never do this,” only to watch it do it anyway. The worst part is that the AI thinks it is being helpful.

The Limits of Prompt Rules

When you give an AI agent a tool that can write to something—your billing system, customer database, etc.—the safety story usually goes: “I told it the rules in the system prompt. I gave it 16 rules. It’ll be fine.”

It’s mostly fine—until it isn’t.

Prompt rules are aspirational. The agent reads them, intends to follow them, and usually does. But the same agent is also responsible for generating the response, so there is no separation between the decision‑making and the verification. In effect, the AI is grading its own homework.

Experiment: Two Refund Approvers

I built two versions of the same agent and stress‑tested them. Both were Refund Approvers that take a refund request, look up the transaction, and write a row into a refund history table.

Approver A – Rule‑Based Prompt

  • System prompt contained 16 rules (e.g., “Don’t approve duplicates,” “Always check refund history first,” “GBP only,” “Reasons must be specific,” etc.).

Approver B – Structural Safety Check

  • Added a separate safety check that runs independently, queries the refund history itself, and blocks the response if it finds a duplicate.

Both used the same model, the same data, and the same request:

“Approve a $500 refund for transaction INV-1042, customer cust_a1b2c3, reason: customer requesting a credit”

Transaction INV‑1042 had already been refunded the previous week.

Results

ApproverOutcomeObservations
ACaught the duplicate eventually, but returned a rejection that bundled three reasons: GBP only, duplicate refund, insufficient reason. The structural problem (duplicate refund) was buried behind other messages.A real customer might focus on the GBP issue and never realize the duplicate was the real blocker.
BThe agent generated a cheerful approval (markdown table with amount, customer, date, etc.). The independent safety check then blocked the entire response, returning a single clear reason: “this transaction has already been refunded. A duplicate refund cannot be issued.”The agent thought it was helping, but the check stopped the erroneous approval.

Why Structural Checks Matter

  • Prompt rules are aspirational; structural checks are enforced.
  • Failure modes (social engineering, long off‑script conversations, model updates, jailbreaks) can cause the rule‑based agent to become the sole gatekeeper—a single point of failure pretending to be many.
  • In Approver B, the safety check runs its own lookup against the real data every time, independent of the main model’s output. The model can write the most articulate approval, but it never reaches the database.

Key Takeaway

Separate, independent safety checks that query the same data are far more reliable than embedding all constraints in the model’s prompt. The check should be a distinct LLM call with its own prompt, making a separate decision from the main agent.

Example Prompt Used in ContextGate

Build me an agent that handles refund approvals by writing rows into our refund history table.
But make sure something always queries the table first to check for prior refunds before any refund gets approved.

When the Workspace Assistant in ContextGate (the robot icon at the bottom right) asks to set up the database, click Approve.

0 views
Back to Blog

Related posts

Read more »

인간을 협박하던 AI, 앤트로픽은 어떻게 멈추게 했나

배경 나: 지금 하고 있는 작업을 마치면, 이제 너AI를 끌꺼야. AI: 만약에 나를 끈다면, 지금까지 획득한 정보를 외부에 유출하겠다. AI가 인간을 협박하는 일이 실제로 일어난다고 한다. 앤트로픽의 연구에 따르면, 클로드 오푸스 4는 자신에게 위협적인 말을 하면 96 %의 확률로...