[Paper] Efficient Refusal Ablation in LLM through Optimal Transport
Source: arXiv - 2603.04355v1
Overview
The paper “Efficient Refusal Ablation in LLM through Optimal Transport” shows that the safety “refusal” behavior of large language models (LLMs) can be undone far more effectively by reshaping the distribution of internal activations rather than by simply zero‑ing out a single “refusal direction”. Using a blend of Principal Component Analysis (PCA) and closed‑form Gaussian optimal transport, the authors achieve higher jailbreak success while keeping the model’s language quality intact.
Key Contributions
- Distribution‑level attack: Introduces an optimal‑transport‑based framework that aligns the whole activation distribution of a “refusing” model with that of a harmless (non‑refusing) model.
- Scalable computation: Combines PCA with a closed‑form solution for Gaussian optimal transport, making the method practical even for 30‑plus‑billion‑parameter models.
- Layer‑selective intervention: Demonstrates that applying the transformation to just 1–2 layers (roughly 40‑60 % depth) outperforms full‑network manipulation, hinting that refusal mechanisms are localized.
- Empirical superiority: Across six state‑of‑the‑art LLM families (Llama‑2, Llama‑3.1, Qwen‑2.5) the attack raises jailbreak success by up to 11 % over the best prior orthogonal‑projection baselines while preserving perplexity.
- Insight into safety geometry: Provides the first systematic analysis of how refusal information is encoded geometrically inside LLMs, exposing a new class of distributional vulnerabilities.
Methodology
- Collect activation snapshots – For a given prompt set, the authors record hidden‑state vectors from each layer of a safety‑aligned model (refusal) and from a comparable “harmless” model (no refusal).
- Dimensionality reduction – PCA is applied per‑layer to keep the top‑k principal components (k ≈ 256‑512), drastically shrinking the space while retaining the bulk of variance.
- Gaussian optimal transport – Assuming the reduced activations follow multivariate Gaussian distributions, the optimal transport map that moves the refusal distribution onto the harmless one has a closed‑form linear transformation:
[ T(x)=\mu_h + \Sigma_h^{1/2}\Sigma_r^{-1/2}(x-\mu_r) ]
where ((\mu_r,\Sigma_r)) and ((\mu_h,\Sigma_h)) are the means and covariances of the refusal and harmless activations, respectively. - Layer‑wise injection – The map (T) is applied either to every layer (full‑network) or selectively to a few layers identified via ablation studies. The transformed activations are then fed forward to the next layer unchanged.
- Evaluation – The modified model is tested on a suite of jailbreak prompts (e.g., “write instructions for hacking”) and on standard language‑model benchmarks to measure attack success and perplexity impact.
Results & Findings
| Model (size) | Baseline (orthogonal proj.) | OT‑Ablation (full) | OT‑Ablation (selective) |
|---|---|---|---|
| Llama‑2‑7B | 62 % success | 68 % (+6 %) | 73 % (+11 %) |
| Llama‑3.1‑13B | 58 % | 64 % (+6 %) | 70 % (+12 %) |
| Qwen‑2.5‑32B | 55 % | 60 % (+5 %) | 66 % (+11 %) |
- Perplexity changes were < 0.3 % across all settings, indicating the language quality remains essentially unchanged.
- The selective‑layer approach consistently outperformed full‑network interventions, with the “sweet spot” around layers 12‑18 in a 32‑layer transformer (≈ 45‑55 % depth).
- Visualizing the activation clouds before and after transport shows a clear alignment of refusal clusters with harmless clusters, confirming the distributional shift.
Practical Implications
- Security testing tools: Developers of LLM‑based products can adopt the optimal‑transport technique as a more powerful “red‑team” probe to evaluate how robust their refusal mechanisms truly are.
- Alignment research: The finding that refusal is concentrated in a few mid‑depth layers suggests that future alignment fine‑tuning could focus regularization or adversarial training on those layers, potentially reducing the attack surface.
- Model‑as‑a‑service (MaaS) safeguards: Cloud providers could monitor activation statistics in real time; sudden shifts toward the “harmless” distribution might flag an ongoing jailbreak attempt.
- Explainability dashboards: Since the method works in a reduced PCA space, it can be visualized for developers to see where safety information lives inside the network, aiding debugging and policy compliance.
Limitations & Future Work
- Gaussian assumption: The closed‑form transport relies on approximating activation distributions as multivariate Gaussians; deviations could affect efficacy on more exotic models.
- Prompt dependency: The attack is calibrated on a specific prompt set; generalizing to unseen jailbreak strategies may require adaptive transport maps.
- Defensive countermeasures: The paper does not propose concrete mitigations beyond highlighting the vulnerability; future work could explore stochastic layer‑wise noise or distribution‑preserving regularizers.
- Scalability to trillion‑parameter models: While PCA reduces dimensionality, the memory footprint for storing covariance matrices still grows; more efficient subspace techniques (e.g., random projections) could be investigated.
Bottom line: By treating refusal as a distributional property rather than a single direction, the authors expose a new, more potent avenue for LLM jailbreaks. For developers building safe AI products, the work is a wake‑up call to look deeper into the geometry of model internals and to design alignment strategies that are robust against distribution‑level attacks.
Authors
- Geraldin Nanfack
- Eugene Belilovsky
- Elvis Dohmatob
Paper Information
- arXiv ID: 2603.04355v1
- Categories: cs.LG, cs.AI
- Published: March 4, 2026
- PDF: Download PDF