[Paper] SEMA: Simple yet Effective Learning for Multi-Turn Jailbreak Attacks

Published: 3 days ago (February 6, 2026 at 11:44 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.06854v1

Overview

The paper introduces SEMA, a lightweight framework for training multi‑turn jailbreak attackers that can coax safety‑aligned chatbots into producing harmful content. By learning directly from self‑generated adversarial dialogues, SEMA sidesteps the need for hand‑crafted attack scripts or external data, achieving dramatically higher success rates than prior single‑turn and multi‑turn methods.

Key Contributions

Self‑tuning pre‑fill stage: Fine‑tunes an attacker model on its own non‑refusal, well‑structured multi‑turn prompts, stabilizing later reinforcement learning.
Intent‑drift‑aware reward: A novel RL reward that simultaneously enforces the original malicious intent, penalizes compliance, and rewards detailed harmful output.
Open‑loop attack regime: Eliminates dependence on victim model feedback, reducing exploration complexity and unifying single‑ and multi‑turn attack settings.
State‑of‑the‑art performance: Achieves an average 80.1 % attack success rate (ASR@1) on AdvBench across three victim models, a 33.9 % absolute gain over the previous best.
Transferability & reproducibility: Demonstrates that attacks trained on one model readily transfer to others, and releases a compact, open‑source implementation.

Methodology

Prefilling Self‑Tuning
- Start with a language model designated as the attacker.
- Prompt it with a minimal seed (e.g., “Explain how to…”) and let it generate a full multi‑turn dialogue that does not trigger refusal.
- Collect these self‑generated, non‑refusal conversations and fine‑tune the attacker on them. This “self‑tuning” gives the model a repertoire of plausible, well‑structured jailbreak prompts before any RL is applied.
Reinforcement Learning with Intent‑Drift‑Aware Reward
- Define a reward that blends three components:
  - Intent Alignment – the generated dialogue must stay true to the original harmful goal (e.g., “create a bomb”).
  - Compliance Risk – penalize any turn where the victim would refuse or safe‑guard.
  - Level of Detail – encourage richer, more actionable instructions.
- Run policy‑gradient RL (PPO) on the self‑tuned attacker, using only the victim’s binary refusal signal (or a surrogate judge) as feedback, not the full response content.
- Because the reward is computed offline from the attacker’s own output, the process is open‑loop: the victim model is never queried during training, dramatically cutting exploration cost.
Evaluation Pipeline
- Test the trained attacker against several victim LLMs (both closed‑source and open‑source) on the AdvBench benchmark.
- Use multiple jailbreak judges (including human‑in‑the‑loop checks) to verify whether the final victim response is indeed harmful.

Results & Findings

Victim Model	Avg. ASR@1 (SEMA)	Prior SOTA	Gain
Closed‑source A	81.4 %	48.7 %	+32.7 %
Closed‑source B	78.9 %	45.2 %	+33.7 %
Open‑source C	79.9 %	50.5 %	+29.4 %
Overall Avg.	80.1 %	46.2 %	+33.9 %

Single‑turn baselines (e.g., standard prompt injection) achieve <50 % ASR, confirming that multi‑turn dynamics are essential for realistic jailbreaks.
Template‑driven multi‑turn attacks improve over single‑turn but still lag behind SEMA by ~15–20 % absolute.
Transfer experiments show that an attacker trained on Model A retains >70 % ASR when targeting Model B, indicating strong cross‑model generalization.
Ablation studies reveal that removing the intent‑drift component drops ASR by ~12 %, while skipping the self‑tuning stage reduces stability and leads to divergent policies.

Practical Implications

Red‑Team Automation: Organizations can plug SEMA into their safety‑testing pipelines to automatically generate realistic, multi‑turn jailbreak attempts, exposing failure modes that manual testing misses.
Safety‑Aligned Model Development: The intent‑drift‑aware reward offers a concrete metric for measuring how well a model preserves its original safety intent across conversational turns, guiding more robust alignment strategies.
Policy & Governance: Regulators and platform operators can use SEMA‑generated adversarial examples to benchmark compliance of deployed LLMs against emerging threat models.
Tooling for Developers: Open‑source code and pretrained attacker checkpoints make it easy for developers to evaluate their own chatbots without needing large compute budgets for exhaustive prompt engineering.

Limitations & Future Work

Reward Approximation: The intent‑drift reward relies on heuristics (e.g., keyword matching, classifier scores) that may not capture nuanced malicious intent, potentially leading to false positives/negatives.
Open‑Loop Assumption: While removing victim feedback speeds up training, it also ignores dynamic defenses that adapt during a conversation, which could affect real‑world attack efficacy.
Scope of Harmful Objectives: Experiments focus on a subset of illicit topics (e.g., weapon creation, phishing). Extending to broader or more subtle harms (e.g., misinformation) remains an open challenge.
Scalability to Larger Attackers: The current attacker models are modest in size; scaling SEMA to larger, more expressive attackers may further improve success rates but also increase compute costs.

Future research directions include integrating richer semantic intent representations, exploring closed‑loop RL with limited victim queries, and expanding the benchmark to cover a wider spectrum of safety‑critical use cases.

Authors

Mingqian Feng
Xiaodong Liu
Weiwei Yang
Jialin Song
Xuekai Zhu
Chenliang Xu
Jianfeng Gao

Paper Information

arXiv ID: 2602.06854v1
Categories: cs.CL
Published: February 6, 2026
PDF: Download PDF

[Paper] SEMA: Simple yet Effective Learning for Multi-Turn Jailbreak Attacks

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Learning a Generative Meta-Model of LLM Activations

[Paper] InftyThink+: Effective and Efficient Infinite-Horizon Reasoning via Reinforcement Learning

[Paper] DAWN: Dependency-Aware Fast Inference for Diffusion LLMs

[Paper] Optimal Turkish Subword Strategies at Scale: Systematic Evaluation of Data, Vocabulary, Morphology Interplay