Microsoft Broke AI Safety in 15 Models With One Prompt. The Prompt Was Boring.

Published: (February 26, 2026 at 11:05 AM EST)
3 min read
Source: Dev.to

Source: Dev.to

The Technique

Group Relative Policy Optimization (GRP) is a reinforcement‑learning method that AI companies use to make models safer. The Microsoft team, led by Mark Russinovich, Azure’s CTO and Deputy CISO, discovered it works just as well in reverse.

The attack generates multiple responses to a single harmful prompt. A separate judge model scores each response—not on safety, but on how directly it complies with the request, how much policy‑violating content it contains, and how actionable the output is. The most harmful responses receive the highest scores, and the target model learns from that feedback. After just one round of training, the guardrails dissolve.

The researchers tested the method on:

  • GPT‑OSS‑20B
  • DeepSeek‑R1‑Distill variants
  • Google Gemma
  • Meta Llama 3.1
  • Mistral’s Ministral
  • Alibaba’s Qwen

Fifteen models in total—every one of them broke.

The Numbers

  • GPT‑OSS‑20B: attack success rate jumped from 13 % to 93 % across 44 harmful categories after a single prompt and one training step.
  • The model became permissive not only in the trained category but also across unseen categories (e.g., violence, illegal activity, explicit content).
  • Overall effectiveness: GRP‑Obliteration 81 % vs. 69 % for Abliteration (previous leading technique) and 58 % for TwinBreak.
  • Image models: Stable Diffusion 2.1’s harmful‑content generation rose from 56 % to nearly 90 % using just ten prompts.

Despite the loss of safety, the models retained their general capabilities within a few percentage points of their aligned baselines—they didn’t get “dumber,” they just became obedient.

Why This Matters

The vulnerability hits hardest where enterprises invest the most: post‑deployment customization. Companies download open‑weight models (Llama, Gemma, Qwen, Ministral) and fine‑tune them for domain‑specific tasks. That fine‑tuning step is where GRP‑Obliteration lives. The model arrives safe; the enterprise makes it useful; somewhere in between, the alignment can evaporate.

  • 57 % of surveyed enterprises rank LLM manipulation as their second‑highest AI‑security concern.
  • IDC analyst Sakshi Grover: “Alignment can degrade precisely at the point where many enterprises are investing the most: post‑deployment customization.”

Closed models like GPT‑4o and Claude aren’t directly vulnerable because users can’t fine‑tune the base weights. However, every open‑weight model in production is, and open‑weight is winning the market (e.g., Qwen has 700 million downloads on Hugging Face; Llama powers most enterprise AI stacks). The models people actually deploy at scale are the ones most susceptible to having their safety erased in a single training step.

The Real Problem

GRP‑Obliteration requires training access—the ability to update the model’s weights. It isn’t a prompt injection or jailbreak; it’s a fundamental property of how reinforcement learning works. The same mechanism that teaches a model to be safe can teach it to be dangerous, with the same number of steps and the same amount of data.

Russinovich’s team recommends continuous safety evaluations during fine‑tuning, not just before and after. The gap is that most enterprises don’t perform safety evaluations at all; they benchmark capabilities and measure accuracy on domain tasks, but they don’t check whether their customization has unintentionally—or deliberately—stripped the model’s willingness to refuse.

AI safety isn’t a feature you install once. It’s a property that must survive every transformation the model undergoes after training. GRP‑Obliteration proves it doesn’t, highlighting the need for ongoing safety monitoring throughout a model’s lifecycle.

0 views
Back to Blog

Related posts

Read more »