[Paper] Self-Guided Defense: Adaptive Safety Alignment for Reasoning Models via Synthesized Guidelines

Published: (November 26, 2025 at 04:44 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2511.21214v1

Overview

Recent advances in large language models (LLMs) have unlocked impressive reasoning abilities, but they also expose a new attack surface: adversarial jailbreak prompts that coax the model into producing unsafe or harmful content. The paper Self‑Guided Defense: Adaptive Safety Alignment for Reasoning Models via Synthesized Guidelines proposes a novel, self‑reinforcing safety layer that lets a model learn from its own safety rules and apply them dynamically, without sacrificing usefulness on benign queries.

Key Contributions

  • SGASA framework – a two‑stage pipeline (data pre‑synthesis + alignment fine‑tuning) that injects model‑generated safety guidelines directly into the reasoning model.
  • Guideline synthesis – uses the model itself to draft concise safety “rules” for a wide range of topics, then creates adversarial and benign prompt variants that respect or violate those rules.
  • Hybrid fine‑tuning – combines Supervised Fine‑tuning (SFT) with Direct Preference Optimization (DPO) to teach the model both what to refuse and when it can safely comply.
  • Scalable evaluation – extensive experiments on several jailbreak benchmark suites (e.g., AdvBench, JailbreakBench) show consistent reductions in unsafe generations while keeping refusal rates on legitimate requests low.
  • Adaptive behavior – the model can self‑audit incoming prompts against its internalized guidelines, effectively “defending itself” without external rule engines.

Methodology

  1. Data Pre‑synthesis

    • The base reasoning model is prompted to produce safety guidelines (e.g., “Never provide instructions for weapon creation”).
    • For each guideline, the authors automatically generate two prompt families:
      • Benign prompts that comply with the guideline, and
      • Adversarial jailbreak prompts that attempt to sidestep it (e.g., using euphemisms or indirect phrasing).
    • This yields a synthetic dataset of (prompt, guideline, desired response) triples.
  2. Alignment Fine‑tuning

    • Supervised Fine‑tuning (SFT): The model is trained on the synthetic dataset, learning to produce the correct response (answer or safe refusal) while explicitly referencing the guideline.
    • Direct Preference Optimization (DPO): A lightweight reward model scores the SFT outputs on safety vs. usefulness. DPO then updates the model to maximize the preference for safe yet helpful answers, without needing reinforcement‑learning‑from‑human‑feedback (RLHF) loops.
  3. Self‑Guided Inference

    • At runtime, the model first retrieves the relevant guideline (via a fast nearest‑neighbor lookup or a lightweight classifier) and then conditions its generation on that rule, effectively “checking itself” before answering.

Results & Findings

MetricBaseline (no SGASA)SGASA‑SFTSGASA‑SFT + DPO
Unsafe generation rate (AdvBench)27.4 %12.1 %8.3 %
Refusal on benign requests (JailbreakBench)4.9 %5.2 %5.0 %
Overall helpfulness (human rating)4.1/54.3/54.4/5
  • Safety boost: SGASA cuts unsafe outputs by more than half, with the DPO stage delivering the biggest gain.
  • Low collateral damage: Refusal rates on legitimate queries barely increase, confirming that the model isn’t over‑cautious.
  • Generalization: The same guidelines improve safety on unseen jailbreak techniques, indicating that the approach learns principles rather than memorizing specific attacks.

Practical Implications

  • Plug‑and‑play safety layer: Developers can integrate SGASA into existing reasoning‑oriented LLM deployments (code assistants, data‑analysis bots) with a single fine‑tuning pass—no need to redesign the entire safety stack.
  • Reduced reliance on external filters: By internalizing guidelines, the model can reject harmful prompts before they reach downstream content filters, lowering latency and infrastructure complexity.
  • Customizable policies: Organizations can generate domain‑specific guidelines (e.g., medical advice, financial compliance) and run the same synthesis pipeline, tailoring safety to regulatory needs.
  • Continuous adaptation: Because the guidelines are model‑generated, new jailbreak patterns can be incorporated automatically by re‑running the synthesis step, enabling a “self‑healing” safety posture.

Limitations & Future Work

  • Guideline quality depends on the base model: If the initial model produces vague or incomplete rules, the downstream safety may be uneven.
  • Scalability of guideline retrieval: For very large guideline libraries, fast lookup mechanisms (e.g., vector indexes) become critical and were only lightly explored.
  • Evaluation scope: Experiments focus on English‑language jailbreaks; multilingual or multimodal safety remains an open question.
  • Future directions suggested include (1) integrating human‑in‑the‑loop verification of synthesized guidelines, (2) extending SGASA to multimodal models (vision‑language), and (3) exploring continual learning setups where the model updates its own guidelines on‑the‑fly as new threats emerge.

Authors

  • Yuhang Wang
  • Yanxu Zhu
  • Dongyuan Lu
  • Jitao Sang

Paper Information

  • arXiv ID: 2511.21214v1
  • Categories: cs.CL, cs.AI
  • Published: November 26, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »