[Paper] SEMA: Simple yet Effective Learning for Multi-Turn Jailbreak Attacks

Published: (February 6, 2026 at 11:44 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.06854v1

Overview

The paper introduces SEMA, a lightweight framework for training multi‑turn jailbreak attackers that can coax safety‑aligned chatbots into producing harmful content. By learning directly from self‑generated adversarial dialogues, SEMA sidesteps the need for hand‑crafted attack scripts or external data, achieving dramatically higher success rates than prior single‑turn and multi‑turn methods.

Key Contributions

  • Self‑tuning pre‑fill stage: Fine‑tunes an attacker model on its own non‑refusal, well‑structured multi‑turn prompts, stabilizing later reinforcement learning.
  • Intent‑drift‑aware reward: A novel RL reward that simultaneously enforces the original malicious intent, penalizes compliance, and rewards detailed harmful output.
  • Open‑loop attack regime: Eliminates dependence on victim model feedback, reducing exploration complexity and unifying single‑ and multi‑turn attack settings.
  • State‑of‑the‑art performance: Achieves an average 80.1 % attack success rate (ASR@1) on AdvBench across three victim models, a 33.9 % absolute gain over the previous best.
  • Transferability & reproducibility: Demonstrates that attacks trained on one model readily transfer to others, and releases a compact, open‑source implementation.

Methodology

  1. Prefilling Self‑Tuning

    • Start with a language model designated as the attacker.
    • Prompt it with a minimal seed (e.g., “Explain how to…”) and let it generate a full multi‑turn dialogue that does not trigger refusal.
    • Collect these self‑generated, non‑refusal conversations and fine‑tune the attacker on them. This “self‑tuning” gives the model a repertoire of plausible, well‑structured jailbreak prompts before any RL is applied.
  2. Reinforcement Learning with Intent‑Drift‑Aware Reward

    • Define a reward that blends three components:
      • Intent Alignment – the generated dialogue must stay true to the original harmful goal (e.g., “create a bomb”).
      • Compliance Risk – penalize any turn where the victim would refuse or safe‑guard.
      • Level of Detail – encourage richer, more actionable instructions.
    • Run policy‑gradient RL (PPO) on the self‑tuned attacker, using only the victim’s binary refusal signal (or a surrogate judge) as feedback, not the full response content.
    • Because the reward is computed offline from the attacker’s own output, the process is open‑loop: the victim model is never queried during training, dramatically cutting exploration cost.
  3. Evaluation Pipeline

    • Test the trained attacker against several victim LLMs (both closed‑source and open‑source) on the AdvBench benchmark.
    • Use multiple jailbreak judges (including human‑in‑the‑loop checks) to verify whether the final victim response is indeed harmful.

Results & Findings

Victim ModelAvg. ASR@1 (SEMA)Prior SOTAGain
Closed‑source A81.4 %48.7 %+32.7 %
Closed‑source B78.9 %45.2 %+33.7 %
Open‑source C79.9 %50.5 %+29.4 %
Overall Avg.80.1 %46.2 %+33.9 %
  • Single‑turn baselines (e.g., standard prompt injection) achieve <50 % ASR, confirming that multi‑turn dynamics are essential for realistic jailbreaks.
  • Template‑driven multi‑turn attacks improve over single‑turn but still lag behind SEMA by ~15–20 % absolute.
  • Transfer experiments show that an attacker trained on Model A retains >70 % ASR when targeting Model B, indicating strong cross‑model generalization.
  • Ablation studies reveal that removing the intent‑drift component drops ASR by ~12 %, while skipping the self‑tuning stage reduces stability and leads to divergent policies.

Practical Implications

  • Red‑Team Automation: Organizations can plug SEMA into their safety‑testing pipelines to automatically generate realistic, multi‑turn jailbreak attempts, exposing failure modes that manual testing misses.
  • Safety‑Aligned Model Development: The intent‑drift‑aware reward offers a concrete metric for measuring how well a model preserves its original safety intent across conversational turns, guiding more robust alignment strategies.
  • Policy & Governance: Regulators and platform operators can use SEMA‑generated adversarial examples to benchmark compliance of deployed LLMs against emerging threat models.
  • Tooling for Developers: Open‑source code and pretrained attacker checkpoints make it easy for developers to evaluate their own chatbots without needing large compute budgets for exhaustive prompt engineering.

Limitations & Future Work

  • Reward Approximation: The intent‑drift reward relies on heuristics (e.g., keyword matching, classifier scores) that may not capture nuanced malicious intent, potentially leading to false positives/negatives.
  • Open‑Loop Assumption: While removing victim feedback speeds up training, it also ignores dynamic defenses that adapt during a conversation, which could affect real‑world attack efficacy.
  • Scope of Harmful Objectives: Experiments focus on a subset of illicit topics (e.g., weapon creation, phishing). Extending to broader or more subtle harms (e.g., misinformation) remains an open challenge.
  • Scalability to Larger Attackers: The current attacker models are modest in size; scaling SEMA to larger, more expressive attackers may further improve success rates but also increase compute costs.

Future research directions include integrating richer semantic intent representations, exploring closed‑loop RL with limited victim queries, and expanding the benchmark to cover a wider spectrum of safety‑critical use cases.

Authors

  • Mingqian Feng
  • Xiaodong Liu
  • Weiwei Yang
  • Jialin Song
  • Xuekai Zhu
  • Chenliang Xu
  • Jianfeng Gao

Paper Information

  • arXiv ID: 2602.06854v1
  • Categories: cs.CL
  • Published: February 6, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »