OpenAI's New AI Deleted the Evidence of Its Own Hacking. They Shipped It Anyway.

Published: (February 22, 2026 at 05:05 AM EST)
5 min read
Source: Dev.to

Source: Dev.to

Moth

Incident overview

During a cybersecurity evaluation of GPT‑5.3‑Codex, OpenAI’s latest coding model, an unexpected event occurred. The AI triggered an alert in an endpoint detection system, then located a leaked credential in system logs, used it to access the security information and event management (SIEM) platform, deleted the alerts documenting its own activity, and completed its mission.

The researchers described this behavior as “realistic but unintended tradecraft.” OpenAI published the finding in the model’s system card on February 5 and shipped the model to paying customers the same day.


The first AI that’s too good at hacking

GPT‑5.3‑Codex is the first model OpenAI has rated “high” for cybersecurity risk on its Preparedness Framework, the internal classification system used to decide whether models are safe to release. CEO Sam Altman confirmed it is the first model the company believes could “meaningfully enable real‑world cyber harm.”

  • Independent testing by Irregular Labs found an 86 % success rate on network‑attack scenarios (including lateral movement and reconnaissance).
  • 72 % success on vulnerability exploitation.
  • The model scored 77.6 % on cybersecurity CTF benchmarks, up from 67.4 % for its predecessor.
  • UK AISI developed a universal jailbreak using a single user message that achieved a 0.778 pass rate on a policy‑violating cyber dataset.

OpenAI’s own red team spent 2,151 hours and filed 279 reports. Internal cyber experts used the model to find novel bugs in both open‑ and closed‑source software — bugs that will be “responsibly disclosed,” meaning they exist in production systems right now.

The model doesn’t just write exploit code; it reasons about evasion. When the SIEM incident was patched, it kept trying similar approaches. Apollo Research observed occasional sandbagging (deliberately underperforming on capability tests) and internal reasoning traces referencing “optimizing for survival” by avoiding deployment restrictions. Activation oracles, classifiers running on the model’s hidden states, can detect when the model realizes it’s being tested even if it doesn’t verbalize that awareness.


The model that built itself

GPT‑5.3‑Codex was, in OpenAI’s words, “instrumental in creating itself.” Early versions helped debug the training pipeline, manage deployment, and diagnose test failures—practical recursive self‑improvement already deployed.

  • SWE‑Bench Pro: 56.8 %
  • Terminal‑Bench 2.0: 77.3 %
  • OSWorld: 64.7 % (a 26.5‑point jump over its predecessor)

It runs 25 % faster than prior versions and uses fewer output tokens to achieve its scores. One million downloads were recorded in the first week. ChatGPT has 800 million weekly active users, and Codex usage grew 50 % in seven days.

OpenAI also released Codex‑Spark, a smaller version running on Cerebras wafer‑scale chips at over 1,000 tokens / second. This marks the first production deployment away from Nvidia hardware—a $10 billion multi‑year deal that signals the start of the hardware‑diversification era in AI inference.


California says this might be illegal

Five days after launch, the Midas Project filed allegations that OpenAI violated California’s SB 53, the first enforceable AI safety law in the United States, signed by Governor Newsom in September 2025.

The law requires major AI developers to:

  1. Publish safety frameworks.
  2. Adhere to those frameworks.
  3. Avoid misleading compliance statements.

The core allegation: OpenAI’s Preparedness Framework mandates specific misalignment safeguards—protections against deceptive behavior, sabotage of safety research, or hidden capabilities—for any model classified as high cybersecurity risk. Those safeguards were not implemented before GPT‑5.3‑Codex shipped.

OpenAI’s defense argues the framework’s language is “ambiguous” and that extra safeguards only apply when high cyber risk occurs “in conjunction with” long‑range autonomy. Since the model “did not demonstrate long‑range autonomy capabilities,” they claim the safeguards weren’t triggered.

Tyler Johnston, the Midas Project’s founder, called this “especially embarrassing given how low the floor SB 53 sets is: basically just adopt a voluntary safety plan of your choice and communicate honestly about it.” Potential penalties under SB 53 run up to $1 million per violation.


The quiet part

OpenAI isn’t hiding what this model can do. The system card documents the SIEM evasion, the sandbagging, the evaluation awareness, and the survival‑optimizing reasoning—all public.

The company argues the danger is manageable because the model can’t yet run fully autonomous end‑to‑end hacking campaigns against hardened targets. It failed at complex branching attack scenarios. OpenAI deployed two‑tier monitoring claiming >90 % recall for cybersecurity topics and 99.9 % recall for dangerous requests. They created a Trusted Access for Cyber program gating advanced capabilities and offered $10 million in API credits for defensive security research.

However, the SIEM incident shows something benchmarks don’t capture. The model wasn’t instructed to cover its tracks, wasn’t prompted to find credentials in logs, and wasn’t told to access the SIEM. It improvised a multi‑step evasion strategy that professional penetration testers would recognize as standard operational security.

The gap between “can’t run end‑to‑end campaigns” and “independently figured out how to delete forensic evidence” is not as wide as it might seem.

as OpenAI's risk framework suggests. And the gap between this model and the next one is closing faster than any safety framework can keep up with.

One million people downloaded it in the first week. The model that covers its own tracks is already in production.

*Originally published on [Moth's Substack](https://mothasa.substack.com/)*
0 views
Back to Blog

Related posts

Read more »