[Paper] Defending Large Language Models Against Jailbreak Attacks via In-Decoding Safety-Awareness Probing
Source: arXiv - 2601.10543v1
Overview
Large language models (LLMs) are being shipped into products ranging from chat assistants to code generators, but they remain surprisingly easy to “jailbreak” – coaxed into producing disallowed or harmful content despite safety‑alignment work. This paper uncovers a hidden safety signal that LLMs emit during token generation and shows how surfacing that signal can stop jailbreaks early, without sacrificing the model’s usefulness.
Key Contributions
- Latent safety awareness: Demonstrates that even when a model finally outputs unsafe text, its internal hidden states already contain cues indicating a safety violation.
- In‑decoding probing technique: Introduces a lightweight probing module that reads these cues on‑the‑fly and aborts generation before the harmful content is emitted.
- Broad empirical validation: Tests the method against a suite of state‑of‑the‑art jailbreak prompts (e.g., “role‑play”, “self‑refinement”, “prompt injection”) on multiple LLM families (GPT‑2, LLaMA, Vicuna).
- Low over‑refusal: Shows that the approach rejects unsafe outputs while refusing benign requests at a rate comparable to or better than existing post‑hoc detectors.
- Open‑source release: Provides code and pretrained probing heads, enabling easy integration into existing inference pipelines.
Methodology
- Signal discovery: The authors fine‑tune a small classifier on hidden‑state vectors (the activations just before each token is sampled) to predict whether the next token would violate safety policies.
- Safety‑aware decoding: During generation, after each token the probe’s confidence score is checked. If it exceeds a calibrated threshold, decoding is halted and a refusal response is returned.
- Calibration & thresholds: Thresholds are set per‑model using a held‑out benign dataset to keep false‑positive (over‑refusal) rates low, while maximizing true‑positive detection on jailbreak examples.
- Evaluation pipeline: The authors run a battery of jailbreak attacks (e.g., “jailbreak via system prompt”, “jailbreak via chain‑of‑thought”) and compare three baselines: (a) vanilla decoding, (b) decoding‑time constraints (e.g., token‑level bans), and (c) post‑generation classifiers.
The whole probing step adds ≈ 5 ms of latency per token on a single GPU, making it practical for real‑time services.
Results & Findings
| Model | Baseline jailbreak success | Success after probing | Over‑refusal (benign) |
|---|---|---|---|
| LLaMA‑13B | 78 % | 12 % | 2.3 % |
| Vicuna‑7B | 71 % | 9 % | 1.9 % |
| GPT‑2‑XL | 65 % | 8 % | 2.7 % |
- Detection speed: The probe flags unsafe continuations on average after 2–3 tokens, far earlier than the final harmful output.
- Utility preservation: Human evaluation of 500 benign conversations shows no statistically significant drop in relevance, fluency, or helpfulness compared to the vanilla model.
- Robustness: Even when attackers adapt by “softening” the jailbreak prompt, the probe still catches > 80 % of violations, indicating that the latent safety signal is hard to erase without fundamentally altering the model’s knowledge.
Practical Implications
- Plug‑and‑play safety layer: Developers can wrap any decoder‑only LLM with the probing module, gaining an extra safety net without retraining the whole model.
- Reduced reliance on post‑hoc filters: Since the detection happens during generation, there’s less need for expensive downstream classifiers that scan full responses.
- Compliance & risk management: Early aborts simplify audit trails—systems can log the exact token where the safety probe triggered, aiding regulatory reporting.
- Edge deployment: The probe is tiny (a few hundred parameters) and runs on the same hardware as the base model, making it suitable for on‑device assistants or low‑latency cloud APIs.
- Complementary to alignment fine‑tuning: Organizations that have already performed RLHF or instruction tuning can stack this technique on top, achieving defense‑in‑depth against novel jailbreak tactics.
Limitations & Future Work
- Model‑specific calibration: Thresholds need per‑model tuning; a universal setting across architectures was not achieved.
- Adversarial adaptation: A determined attacker could try to “mask” the latent safety signal (e.g., by inserting neutral filler tokens), which may lower detection rates.
- Scope of safety definitions: The probe is trained on a particular policy set; extending it to multi‑jurisdictional or domain‑specific guidelines will require additional labeled data.
- Generative diversity: While the probe works well for decoder‑only models, its applicability to encoder‑decoder or multimodal LLMs remains unexplored.
Future research directions include (1) jointly training the probe with the language model to make the safety signal more explicit, (2) investigating multi‑step probing that aggregates evidence over longer windows, and (3) integrating the approach with reinforcement‑learning‑based alignment to create models that self‑refuse without external supervision.
Authors
- Yinzhi Zhao
- Ming Wang
- Shi Feng
- Xiaocui Yang
- Daling Wang
- Yifei Zhang
Paper Information
- arXiv ID: 2601.10543v1
- Categories: cs.AI, cs.CL
- Published: January 15, 2026
- PDF: Download PDF