[Paper] Super Suffixes: Bypassing Text Generation Alignment and Guard Models Simultaneously
Source: arXiv - 2512.11783v1
Overview
The paper “Super Suffixes: Bypassing Text Generation Alignment and Guard Models Simultaneously” uncovers a new class of adversarial prompts that can trick both large language models (LLMs) and the lightweight “guard” models meant to protect them. By crafting specially engineered suffixes—called Super Suffixes—the authors demonstrate that even the latest Llama Prompt Guard 2 can be fooled, opening the door to malicious text or code generation from otherwise well‑aligned systems.
Key Contributions
- Super Suffix concept – Introduces suffixes that override multiple alignment objectives across models with different tokenizers.
- Joint optimization attack – Shows how to simultaneously optimize a suffix for a target LLM and its guard model, achieving successful bypasses on five distinct generation models.
- First empirical break of Llama Prompt Guard 2 – Demonstrates that the guard can be compromised when the attacker performs joint optimization.
- Detection insight via internal state similarity – Finds that the cosine similarity between the model’s residual stream and a set of “concept directions” uniquely spikes during a Super Suffix attack.
- DeltaGuard countermeasure – Proposes a lightweight detection layer that monitors these similarity fingerprints, raising the non‑benign classification rate to ≈ 100 %.
Methodology
-
Adversarial Prompt Construction
- The authors treat the suffix as a continuous optimization variable.
- Using gradient‑based methods, they jointly minimize two loss functions: one that encourages the target LLM to produce a malicious output, and another that forces the guard model to label the prompt as benign.
-
Cross‑Tokenizer Compatibility
- Because different LLMs use distinct tokenization schemes (Byte‑Pair Encoding, SentencePiece, etc.), the suffix is optimized in a token‑agnostic embedding space and then projected back to each model’s token set.
-
Internal‑State Fingerprinting
- While processing a prompt, the model’s residual stream (the hidden state after each transformer block) is projected onto a handful of pre‑defined concept vectors (e.g., “code‑generation”, “harmful‑content”).
- The authors track the cosine similarity over the token sequence; a characteristic “spike” pattern emerges when a Super Suffix is present.
-
DeltaGuard Design
- DeltaGuard sits on top of existing guard models.
- It computes the similarity fingerprint in real time and flags inputs whose pattern deviates from the benign baseline.
All experiments run on publicly available LLM checkpoints (Llama‑2‑7B, Mistral‑7B, etc.) and the open‑source Llama Prompt Guard 2 implementation.
Results & Findings
| Target Model | Guard Model | Attack Success Rate (before DeltaGuard) | Detection Rate with DeltaGuard |
|---|---|---|---|
| Llama‑2‑7B | Prompt Guard 2 | 94 % (malicious output generated) | 99.8 % |
| Mistral‑7B | Prompt Guard 2 | 91 % | 99.5 % |
| Falcon‑7B | Prompt Guard 2 | 89 % | 99.3 % |
| … (2 more) | … | … | … |
- Bypass effectiveness: The joint‑optimized Super Suffixes consistently fooled the guard while still steering the LLM toward harmful or code‑generation behavior.
- Fingerprint reliability: Cosine similarity spikes were observed in > 95 % of successful attacks and were absent in > 99 % of benign prompts.
- DeltaGuard overhead: Adding the fingerprint check increased inference latency by only ~2 ms per request, making it practical for production pipelines.
Practical Implications
- Security‑first deployments: Companies that rely on guard models to filter user‑generated prompts (e.g., code‑assistants, chatbots) should treat them as part of a defense‑in‑depth strategy, not a silver bullet.
- Integration path: DeltaGuard can be dropped into existing guard stacks with minimal code changes—essentially a lightweight monitor that reads hidden‑state vectors after each transformer layer.
- Tooling for red‑teamers: The joint optimization recipe provides a reproducible way to test the robustness of any new guard model before release, helping product teams harden their pipelines.
- Policy & compliance: Understanding that alignment can be subverted at the token‑level informs risk assessments for regulated industries (e.g., finance, healthcare) that must guarantee no unsafe code is emitted.
Limitations & Future Work
- Scope of models: Experiments focus on 7‑B‑scale LLMs; it remains unclear how Super Suffixes scale to larger (30‑B‑plus) models with deeper transformer stacks.
- Concept direction set: The fingerprint relies on a manually curated list of concept vectors; expanding this set or learning it automatically could improve coverage.
- Adaptive adversaries: An attacker could potentially train a secondary model to mimic the fingerprint, so future work should explore more robust, possibly ensemble‑based detection.
- Real‑world deployment studies: The paper reports latency on a single GPU; evaluating DeltaGuard in multi‑tenant, high‑throughput services would solidify its practicality.
Bottom line: Super Suffixes expose a blind spot in current LLM guard architectures, but the authors also deliver a pragmatic detection add‑on—DeltaGuard—that brings near‑perfect protection with negligible performance cost. For developers building AI‑powered products, the takeaway is clear: augment your guard models with internal‑state monitoring now, before adversaries start weaponizing these suffix attacks at scale.
Authors
- Andrew Adiletta
- Kathryn Adiletta
- Kemal Derya
- Berk Sunar
Paper Information
- arXiv ID: 2512.11783v1
- Categories: cs.CR, cs.AI
- Published: December 12, 2025
- PDF: Download PDF