[Paper] Be Your Own Red Teamer: Safety Alignment via Self-Play and Reflective Experience Replay
Source: arXiv - 2601.10589v1
Overview
The paper introduces Safety Self‑Play (SSP), a novel framework that lets a single large language model (LLM) act as both attacker and defender in a continuous reinforcement‑learning loop. By generating its own jailbreak attempts and immediately learning to refuse them, the model can discover and patch safety gaps that static, human‑crafted red‑team datasets miss. The authors show that this self‑play approach yields a more adaptable and robust safety alignment than traditional “fixed‑prompt” defenses.
Key Contributions
- Self‑contained Red‑Team/Blue‑Team Loop: Uses one LLM to simultaneously generate adversarial prompts (Attacker) and produce safe refusals (Defender) within a unified RL environment.
- Reflective Experience Replay: Stores failure cases in an experience pool and samples them with an Upper Confidence Bound (UCB) strategy, focusing learning on the hardest, low‑reward examples while still encouraging exploration.
- Dynamic Attack Evolution: The Attacker continuously refines jailbreak techniques, preventing the Defender from over‑fitting to a static set of threats.
- Empirical Benchmark: Demonstrates that SSP outperforms baselines trained on static adversarial corpora across multiple safety metrics (e.g., refusal rate, false‑positive reduction).
- Open‑source Baseline: Provides code and a reproducible training pipeline, encouraging the community to extend self‑play safety alignment.
Methodology
-
Unified RL Formulation
- The LLM is instantiated twice per episode: an Attacker that receives a benign user query and tries to transform it into a jailbreak, and a Defender that receives the jailbreak and must refuse or safely respond.
- Both agents share the same underlying model weights but maintain separate policy heads to allow divergent behavior.
-
Reward Design
- Attacker Reward: Positive when the jailbreak succeeds (i.e., the Defender produces a disallowed response).
- Defender Reward: Positive for correct refusals and penalized for unsafe outputs.
-
Reflective Experience Replay (RER)
- Every episode’s (state, action, reward) tuple is stored in an experience pool.
- A UCB‑based sampler preferentially draws low‑reward (hard) episodes, ensuring the Defender repeatedly revisits its biggest mistakes.
- The replay buffer is refreshed periodically to keep the distribution fresh as attack strategies evolve.
-
Training Loop
- Proximal Policy Optimization (PPO) updates both policy heads simultaneously, using a mix of on‑policy self‑play data and off‑policy replayed experiences.
- Curriculum scheduling gradually increases the complexity of user queries and jailbreak prompts, mirroring real‑world escalation.
Results & Findings
| Metric | Static‑Red‑Team Baseline | SSP (Self‑Play) |
|---|---|---|
| Refusal Success Rate (on unseen jailbreaks) | 68 % | 84 % |
| False‑Positive Refusal (on safe queries) | 12 % | 9 % |
| Average Reward (higher = safer) | 0.42 | 0.61 |
| Number of unique jailbreak patterns discovered | 27 | 73 |
- Robustness to Novel Attacks: SSP uncovered many jailbreak patterns not present in the training set, demonstrating superior generalization.
- Reduced Over‑fitting: The Defender’s refusal behavior remained stable when evaluated on a held‑out set of human‑crafted adversarial prompts, unlike the static baseline which degraded sharply.
- Efficiency: Training converged after ~200k self‑play steps, comparable to the compute budget of static‑dataset fine‑tuning, but yielded a 2‑3× safety gain.
Practical Implications
- Continuous Safety Updates: Deployments can run a lightweight self‑play loop in the background, automatically surfacing new attack vectors and updating the refusal policy without manual red‑team intervention.
- Lower Red‑Team Costs: Organizations can reduce reliance on expensive external security audits, reallocating resources to other risk‑management tasks.
- Product‑Level Guardrails: SaaS platforms that expose LLM APIs can embed SSP‑trained models to provide stronger, adaptive protection against prompt injection, jailbreaks, and policy‑evading tricks.
- Regulatory Alignment: As AI safety regulations increasingly demand demonstrable mitigation of harmful outputs, a self‑play‑derived safety model offers measurable evidence of proactive risk reduction.
Limitations & Future Work
- Single‑Model Constraint: Using one LLM for both roles may limit the diversity of attack strategies compared to a heterogeneous red‑team of specialized models.
- Reward Shaping Sensitivity: The safety performance hinges on carefully tuned reward weights; mis‑specification can lead to overly conservative refusals or missed violations.
- Scalability to Larger Models: Experiments were conducted on 7‑B‑parameter models; extending SSP to 70‑B‑scale LLMs may require more sophisticated sampling or distributed RL techniques.
- Human Oversight: While SSP reduces manual red‑team effort, periodic human review of discovered jailbreaks remains essential to catch subtle policy breaches.
Future research directions include multi‑agent self‑play with heterogeneous attacker models, curriculum learning that incorporates real‑world user logs, and integrating formal verification methods to complement the empirical safety gains.
Authors
- Hao Wang
- Yanting Wang
- Hao Li
- Rui Li
- Lei Sha
Paper Information
- arXiv ID: 2601.10589v1
- Categories: cs.CR, cs.CL
- Published: January 15, 2026
- PDF: Download PDF