[Paper] Learning to Attack and Defend: Adaptive Red Teaming of Language Models via GRPO

Published: 3 days ago (June 8, 2026 at 12:21 PM EDT)

2 min read

Source: arXiv

Source: arXiv - 2606.09701v1

Overview

AI red teaming must continually adapt to evolving attackers and defenders. Reinforcement learning offers a promising approach to discovering novel attacks, and co-training methods can produce more robust defenders in tandem. Recent works have demonstrated the efficacy of attacker-defender co-training by applying PPO and DPO, but report that GRPO is unstable in this setting. We introduce AdvGRPO, a co-training framework that makes GRPO viable for joint attacker-defender optimization using dense multi-channel rewards and decoupled advantage normalization. Training progresses through a curriculum from single-turn to closed-loop multi-turn attacks before bootstrapping co-training, where attacker and defender models are updated in alternation. We show that our method can produce highly effective and transferable attacks and that co-trained defenders outperform baselines on safety benchmarks.

Key Contributions

This paper presents research in the following areas:

cs.CL
cs.AI
cs.LG

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.CL.

Authors

Blake Bullwinkel
Eugenia Kim
Amanda Minnich
Mark Russinovich

Paper Information

arXiv ID: 2606.09701v1
Categories: cs.CL, cs.AI, cs.LG
Published: June 8, 2026
PDF: Download PDF

[Paper] Learning to Attack and Defend: Adaptive Red Teaming of Language Models via GRPO

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] Context-Driven Incremental Compression for Multi-Turn Dialogue Generation

[Paper] Redesign Mixture-of-Experts Routers with Manifold Power Iteration

[Paper] System Report for CCL25-Eval Task 5: New Dataset and LoRA-Fine-Tuned Qwen2.5

[Paper] Breaking Entropy Bounds: Accelerating RL Training via MTP with Rejection Sampling