Designing AI agents to resist prompt injection

Published: (March 11, 2026 at 07:30 AM EDT)
5 min read
Source: OpenAI Blog

Source: OpenAI Blog

Introduction

AI agents are increasingly able to browse the web, retrieve information, and take actions on a user’s behalf. Those capabilities are useful, but they also create new ways for attackers to try to manipulate the system.

These attacks are often described as prompt injection: instructions placed in external content in an attempt to make the model do something the user did not ask for. In our experience, the most effective real‑world versions of these attacks increasingly resemble social engineering more than simple prompt overrides.

Why the Shift Matters

If the problem is not just identifying a malicious string, but resisting misleading or manipulative content in context, then defending against it cannot rely only on filtering inputs. It also requires designing the system so that the impact of manipulation is constrained, even if some attacks succeed.

Early “prompt injection” attacks could be as simple as editing a Wikipedia article to include direct instructions to AI agents visiting it; without training‑time experience of such an adversarial environment, AI models would often follow those instructions without question1. As models have become smarter, they’ve also become less vulnerable to this kind of suggestion, and we’ve observed that prompt‑injection‑style attacks have responded by including elements of social engineering.

The Limits of “AI Firewalling”

Within the wider AI security ecosystem it has become common to recommend techniques such as AI firewalling, where an intermediary between the AI agent and the outside world attempts to classify inputs into malicious prompt injection and regular inputs. However, fully developed attacks are not usually caught by such systems. Detecting a malicious input becomes the same very difficult problem as detecting a lie or misinformation—often without the necessary context.

Treating Prompt Injection as Social Engineering

As real‑world prompt‑injection attacks grew in complexity, we found that the most effective offensive techniques leveraged classic social‑engineering tactics. Rather than treating these attacks as a separate class of problem, we began to view them through the same lens used to manage social‑engineering risk on humans in other domains.

In those systems, the goal is not limited to perfectly identifying malicious inputs; it is to design agents and systems so that the impact of manipulation is constrained, even if it succeeds. Such systems have proven effective at mitigating both prompt injection and social engineering.

A Three‑Actor Analogy

We can imagine the AI agent as existing in a similar three‑actor system as a customer‑service agent:

  1. The Agent – wants to act on behalf of its employer.
  2. External Input – may attempt to mislead the agent (e.g., malicious web content, deceptive user messages).
  3. The Employer / System – imposes limits on the agent’s capabilities to bound downside risk.

Example: A human customer‑support representative can issue gift cards or refunds. The corporation must trust that the representative gives refunds for the right reasons, while the representative also interacts with third parties who may try to mislead or coerce them. Deterministic safeguards (e.g., caps on refund amounts, phishing‑email flags) limit the impact of a compromised individual.

Our Countermeasure Suite

This mindset has informed a robust suite of countermeasures we have deployed that uphold the security expectations of our users.

In ChatGPT, we combine this social‑engineering model with more traditional security‑engineering approaches such as source‑sink analysis.

  • Source – a way for an attacker to influence the system (e.g., untrusted external content).
  • Sink – a capability that becomes dangerous in the wrong context (e.g., transmitting information to a third party, following a link, invoking a tool).

Our goal is to preserve a core security expectation for users: potentially dangerous actions, or transmissions of potentially sensitive information, should not happen silently or without appropriate safeguards.

Safe Url Mitigation

Attacks we see developed against ChatGPT most often consist of attempting to convince the assistant to take secret information from a conversation and transmit it to a malicious third party. In most cases, these attacks fail because our safety training causes the agent to refuse.

For the rare cases where the agent is convinced, we have developed a mitigation strategy called Safe Url, which:

  1. Detects when information learned in the conversation would be transmitted to a third party.
  2. Either shows the user the information that would be transmitted and asks for confirmation, or blocks the transmission and tells the agent to try another way of moving forward with the user’s request.

Recommendations for Autonomous Agents

Safe interaction with the adversarial outside world is necessary for fully autonomous agents. When integrating an AI model with an application system, we recommend:

  1. Ask what controls a human agent would have in a similar situation and implement those controls for the AI.
  2. Limit the agent’s capabilities (e.g., rate‑limit actions, require multi‑step verification for high‑risk operations).
  3. Continuously monitor for anomalous behavior and provide audit trails.

We expect that a maximally intelligent AI model will be able to resist social engineering better than a human agent, but this is not guaranteed. Ongoing vigilance, layered defenses, and a design philosophy that assumes manipulation can succeed are essential.

… ways feasible or cost‑effective depending on the application.

We continue to explore the implications of social engineering against AI models and defenses against it, and incorporate our findings both into our application security architectures and the training we put our AI models through.

References

Footnotes

  1. Early prompt‑injection attacks that edited public content (e.g., Wikipedia) could cause naïve models to follow malicious instructions without question. This observation motivated the shift toward more sophisticated, socially engineered attacks.

0 views
Back to Blog

Related posts

Read more »