[Paper] Endogenous Resistance to Activation Steering in Language Models

Published: 2 months ago (February 6, 2026 at 01:41 PM EST)

5 min read

Source: arXiv

Source: arXiv

Overview

The paper uncovers a surprising behavior in large language models (LLMs) called Endogenous Steering Resistance (ESR)—the model’s ability to “push back” against external activation‑steering signals during inference and recover to produce higher‑quality, on‑topic completions. By probing the internal latent space of Llama‑3.3‑70B, the authors show that:

ESR is more pronounced in the biggest models.
It can be triggered deliberately.
It is linked to a handful of identifiable autoencoder latents that act like internal consistency‑check circuits.

Key Contributions

Discovery of ESR – Shows that large language models can autonomously resist misaligned activation steering and self‑correct mid‑generation.
Latent‑level analysis – Identifies 26 sparse autoencoder (SAE) latents whose activation patterns correlate with off‑topic drift and causally drive ESR in Llama‑3.3‑70B.
Causal intervention – Demonstrates that zero‑ablating these latents reduces the multi‑attempt (self‑correction) rate by ~25 %, confirming their functional role.
Steering amplification – Introduces “meta‑prompts” that ask the model to self‑monitor, boosting ESR‑driven multi‑attempts by 4× in the 70B model.
Transfer to smaller models – Fine‑tunes Gemma‑2‑9B on self‑correction data, inducing ESR‑like behavior despite the model’s original lack of resistance.
Open‑source tooling – Releases code and SAE latents for reproducibility (github.com/agencyenterprise/endogenous-steering-resistance).

Methodology

Activation Steering via SAE Latents
- The authors train sparse autoencoders on hidden activations of Llama‑3 series models.
- Each latent captures a compact, interpretable direction in activation space that can be nudged during inference (the “steering” signal).
Detecting ESR
- A suite of prompts is used to provoke off‑topic or low‑quality generations while continuously applying a steering vector that pushes the model toward a target response.
- ESR is flagged when the model, after an initial misstep, generates a second attempt that aligns better with the intended task without turning off the steering.
Latent Attribution
- By correlating per‑token latent activations with ESR events, the authors isolate 26 latents that spike during off‑topic content.
- Counterfactual experiments (zero‑ablation) test whether removing these latents changes ESR frequency.
Meta‑Prompting & Fine‑Tuning
- Two complementary strategies are explored:
  a. Prompting the model with a “self‑monitor” instruction before the main task.
  b. Fine‑tuning on a curated dataset of self‑corrected completions.
- Both aim to amplify the endogenous resistance mechanism.

Implementation notes

The pipeline is deliberately lightweight: SAE training uses only a few hundred thousand activation snapshots.
Steering is applied via a simple additive bias to the latent vector, making the approach reproducible on commodity GPU clusters.

Results & Findings

Experiment	Metric	Outcome
Baseline ESR detection (no meta‑prompt)	Multi‑attempt rate (percentage of prompts that self‑correct)	12 % for Llama‑3.3‑70B; < 4 % for Llama‑3‑8B & Gemma‑2‑9B
Zero‑ablation of 26 ESR latents	Change in multi‑attempt rate	↓ 25 % (from 12 % to ~9 %)
Meta‑prompt “self‑monitor”	Multi‑attempt rate	↑ 4× (≈48 % for Llama‑3.3‑70B)
Fine‑tuning small model on self‑correction data	Emergence of ESR	Small models achieve ~8 % multi‑attempt rate, comparable to baseline large model
Adversarial steering test (malicious target)	Success of adversarial steering	ESR reduces adversarial success by ~30 % in the 70B model

Interpretation

ESR is an emergent, model‑size‑dependent capability that appears to be mediated by a compact set of latent circuits.
The mechanism can be both harnessed (via prompting) and suppressed (via targeted ablation), suggesting a controllable safety lever.
While ESR can act as a guard against malicious steering, it may also blunt intentional safety interventions that rely on the same activation‑steering techniques.

Practical Implications

Safety‑first tooling: Developers building alignment or content‑filtering layers that use activation steering should know that ESR can counteract their signals, especially in models larger than 70 B parameters.
Self‑monitoring assistants: The meta‑prompt technique offers a low‑cost way to make models self‑audit their outputs, improving reliability in code generation, summarization, and customer‑support bots.
Model debugging: Identified SAE latents provide a diagnostic hook; monitoring their activation can flag when a model drifts off‑topic, enabling early intervention.
Fine‑tuning recipes: Smaller, cost‑effective models can acquire ESR‑like self‑correction by fine‑tuning on a modest self‑correction dataset, reducing the need for expensive inference‑time steering.
Adversarial robustness: Products exposing LLM APIs to untrusted users (e.g., chat platforms) may benefit from ESR as an additional barrier against prompt‑injection attacks that attempt to hijack model behavior via activation manipulation.

Limitations & Future Work

Model scope: Experiments focus on the Llama‑3 family and Gemma‑2; it remains unclear how ESR manifests in other architectures (e.g., GPT‑4, Claude).
Latency overhead: Steering via SAE latents adds a small per‑token computation cost, which could be non‑trivial in high‑throughput services.
Granularity of control: Zero‑ablation reduces ESR but does not eliminate it, indicating additional undiscovered circuits.
Safety trade‑offs: Amplifying ESR improves self‑correction but may also make it harder to enforce external safety constraints; balancing these forces needs systematic study.
Future directions:
1. Expand latent analysis to multimodal models.
2. Build automated tools that toggle ESR on/off based on task requirements.
3. Explore how ESR interacts with reinforcement‑learning‑from‑human‑feedback pipelines.

Authors

Alex McKenzie
Keenan Pepper
Stijn Servaes
Martin Leitgab
Murat Cubuktepe
Mike Vaiana
Diogo de Lucena
Judd Rosenblatt
Michael S. A. Graziano

Paper Information

arXiv ID: 2602.06941v1
Categories: cs.LG, cs.AI, cs.CL
Published: February 6, 2026
PDF: Download PDF

[Paper] Endogenous Resistance to Activation Steering in Language Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

Good vs Bad Prompting: What I Learned While Working With AI Models

Study: Platforms that rank the latest LLMs can be unreliable

Beyond RAG: Building an AI Companion with 'Deep Memory' using Knowledge Graphs

Stop Generating, Start Thinking