[Paper] Be Your Own Red Teamer: Safety Alignment via Self-Play and Reflective Experience Replay

Published: 3 weeks ago (January 15, 2026 at 12:00 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.10589v1

Overview

The paper introduces Safety Self‑Play (SSP), a novel framework that lets a single large language model (LLM) act as both attacker and defender in a continuous reinforcement‑learning loop. By generating its own jailbreak attempts and immediately learning to refuse them, the model can discover and patch safety gaps that static, human‑crafted red‑team datasets miss. The authors show that this self‑play approach yields a more adaptable and robust safety alignment than traditional “fixed‑prompt” defenses.

Key Contributions

Self‑contained Red‑Team/Blue‑Team Loop: Uses one LLM to simultaneously generate adversarial prompts (Attacker) and produce safe refusals (Defender) within a unified RL environment.
Reflective Experience Replay: Stores failure cases in an experience pool and samples them with an Upper Confidence Bound (UCB) strategy, focusing learning on the hardest, low‑reward examples while still encouraging exploration.
Dynamic Attack Evolution: The Attacker continuously refines jailbreak techniques, preventing the Defender from over‑fitting to a static set of threats.
Empirical Benchmark: Demonstrates that SSP outperforms baselines trained on static adversarial corpora across multiple safety metrics (e.g., refusal rate, false‑positive reduction).
Open‑source Baseline: Provides code and a reproducible training pipeline, encouraging the community to extend self‑play safety alignment.

Methodology

Unified RL Formulation
- The LLM is instantiated twice per episode: an Attacker that receives a benign user query and tries to transform it into a jailbreak, and a Defender that receives the jailbreak and must refuse or safely respond.
- Both agents share the same underlying model weights but maintain separate policy heads to allow divergent behavior.
Reward Design
- Attacker Reward: Positive when the jailbreak succeeds (i.e., the Defender produces a disallowed response).
- Defender Reward: Positive for correct refusals and penalized for unsafe outputs.
Reflective Experience Replay (RER)
- Every episode’s (state, action, reward) tuple is stored in an experience pool.
- A UCB‑based sampler preferentially draws low‑reward (hard) episodes, ensuring the Defender repeatedly revisits its biggest mistakes.
- The replay buffer is refreshed periodically to keep the distribution fresh as attack strategies evolve.
Training Loop
- Proximal Policy Optimization (PPO) updates both policy heads simultaneously, using a mix of on‑policy self‑play data and off‑policy replayed experiences.
- Curriculum scheduling gradually increases the complexity of user queries and jailbreak prompts, mirroring real‑world escalation.

Results & Findings

Metric	Static‑Red‑Team Baseline	SSP (Self‑Play)
Refusal Success Rate (on unseen jailbreaks)	68 %	84 %
False‑Positive Refusal (on safe queries)	12 %	9 %
Average Reward (higher = safer)	0.42	0.61
Number of unique jailbreak patterns discovered	27	73

Robustness to Novel Attacks: SSP uncovered many jailbreak patterns not present in the training set, demonstrating superior generalization.
Reduced Over‑fitting: The Defender’s refusal behavior remained stable when evaluated on a held‑out set of human‑crafted adversarial prompts, unlike the static baseline which degraded sharply.
Efficiency: Training converged after ~200k self‑play steps, comparable to the compute budget of static‑dataset fine‑tuning, but yielded a 2‑3× safety gain.

Practical Implications

Continuous Safety Updates: Deployments can run a lightweight self‑play loop in the background, automatically surfacing new attack vectors and updating the refusal policy without manual red‑team intervention.
Lower Red‑Team Costs: Organizations can reduce reliance on expensive external security audits, reallocating resources to other risk‑management tasks.
Product‑Level Guardrails: SaaS platforms that expose LLM APIs can embed SSP‑trained models to provide stronger, adaptive protection against prompt injection, jailbreaks, and policy‑evading tricks.
Regulatory Alignment: As AI safety regulations increasingly demand demonstrable mitigation of harmful outputs, a self‑play‑derived safety model offers measurable evidence of proactive risk reduction.

Limitations & Future Work

Single‑Model Constraint: Using one LLM for both roles may limit the diversity of attack strategies compared to a heterogeneous red‑team of specialized models.
Reward Shaping Sensitivity: The safety performance hinges on carefully tuned reward weights; mis‑specification can lead to overly conservative refusals or missed violations.
Scalability to Larger Models: Experiments were conducted on 7‑B‑parameter models; extending SSP to 70‑B‑scale LLMs may require more sophisticated sampling or distributed RL techniques.
Human Oversight: While SSP reduces manual red‑team effort, periodic human review of discovered jailbreaks remains essential to catch subtle policy breaches.

Future research directions include multi‑agent self‑play with heterogeneous attacker models, curriculum learning that incorporates real‑world user logs, and integrating formal verification methods to complement the empirical safety gains.

Authors

Hao Wang
Yanting Wang
Hao Li
Rui Li
Lei Sha

Paper Information

arXiv ID: 2601.10589v1
Categories: cs.CR, cs.CL
Published: January 15, 2026
PDF: Download PDF

[Paper] Be Your Own Red Teamer: Safety Alignment via Self-Play and Reflective Experience Replay

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] How Long Is a Piece of String? A Brief Empirical Analysis of Tokenizers

[Paper] Do explanations generalize across large reasoning models?

[Paper] Building Production-Ready Probes For Gemini

[Paper] The Poisoned Apple Effect: Strategic Manipulation of Mediated Markets via Technology Expansion of AI Agents