[Paper] Endogenous Resistance to Activation Steering in Language Models

Published: (February 6, 2026 at 01:41 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.06941v1

Overview

The paper uncovers a surprising behavior in large language models (LLMs) called Endogenous Steering Resistance (ESR) – the model’s ability to “push back” against external activation‑steering signals during inference and recover to produce higher‑quality, on‑topic completions. By probing the internal latent space of Llama‑3.3‑70B, the authors show that ESR is more pronounced in the biggest models, can be triggered deliberately, and is linked to a handful of identifiable autoencoder latents that act like internal consistency‑check circuits.

Key Contributions

  • Discovery of ESR: Demonstrates that LLMs can autonomously resist misaligned activation steering and self‑correct mid‑generation.
  • Latent‑level analysis: Identifies 26 sparse autoencoder (SAE) latents whose activation patterns correlate with off‑topic drift and causally drive ESR in Llama‑3.3‑70B.
  • Causal intervention: Shows that zero‑ablating these latents cuts the multi‑attempt (self‑correction) rate by ~25 %, confirming their functional role.
  • Steering amplification: Introduces “meta‑prompts” that ask the model to self‑monitor, boosting ESR‑driven multi‑attempts by 4× in the 70B model.
  • Transfer to smaller models: Fine‑tunes Gemma‑2‑9B on self‑correction data, inducing ESR‑like behavior despite the model’s original lack of resistance.
  • Open‑source tooling: Releases code and SAE latents for reproducibility (github.com/agencyenterprise/endogenous-steering-resistance).

Methodology

  1. Activation Steering via SAE Latents – The authors train sparse autoencoders on hidden activations of Llama‑3 series models. Each latent captures a compact, interpretable direction in activation space that can be nudged during inference (the “steering” signal).
  2. Detecting ESR – They run a suite of prompts designed to provoke off‑topic or low‑quality generations while continuously applying a steering vector that pushes the model toward a target response. ESR is flagged when the model, after an initial misstep, generates a second attempt that aligns better with the intended task without turning off the steering.
  3. Latent Attribution – By correlating per‑token latent activations with ESR events, they isolate 26 latents that spike during off‑topic content. Counterfactual experiments (zero‑ablation) test whether removing these latents changes ESR frequency.
  4. Meta‑Prompting & Fine‑Tuning – Two complementary strategies are explored: (a) prompting the model with a “self‑monitor” instruction before the main task, and (b) fine‑tuning on a curated dataset of self‑corrected completions. Both aim to amplify the endogenous resistance mechanism.

The pipeline is deliberately lightweight: SAE training uses a few hundred thousand activation snapshots, and steering is applied via simple additive bias to the latent vector, making the approach reproducible on commodity GPU clusters.

Results & Findings

ExperimentMetricOutcome
Baseline ESR detection (no meta‑prompt)Multi‑attempt rate (percentage of prompts that self‑correct)12 % for Llama‑3.3‑70B; < 4 % for Llama‑3‑8B & Gemma‑2‑9B
Zero‑ablation of 26 ESR latentsChange in multi‑attempt rate↓ 25 % (from 12 % to ~9 %)
Meta‑prompt “self‑monitor”Multi‑attempt rate↑ 4× (≈48 % for Llama‑3.3‑70B)
Fine‑tuning small model on self‑correction dataEmergence of ESRSmall models achieve ~8 % multi‑attempt rate, comparable to baseline large model
Adversarial steering test (malicious target)Success of adversarial steeringESR reduces adversarial success by ~30 % in the 70B model

Interpretation:

  • ESR is an emergent, model‑size‑dependent capability that appears to be mediated by a compact set of latent circuits.
  • The mechanism can be both harnessed (via prompting) and suppressed (via targeted ablation), suggesting a controllable safety lever.
  • While ESR can act as a guard against malicious steering, it may also blunt intentional safety interventions that rely on the same activation‑steering techniques.

Practical Implications

  • Safety‑first tooling: Developers building alignment or content‑filtering layers that use activation steering should be aware that ESR may counteract their signals, especially in 70B‑plus models.
  • Self‑monitoring assistants: The meta‑prompt technique offers a cheap way to get models to self‑audit their outputs, potentially improving reliability in code generation, summarization, or customer‑support bots.
  • Model debugging: The identified SAE latents provide a diagnostic hook: monitoring their activation can flag when a model is drifting off‑topic, enabling early intervention.
  • Fine‑tuning recipes: Smaller, cost‑effective models can be endowed with ESR‑like self‑correction by fine‑tuning on a modest self‑correction dataset, reducing the need for massive inference‑time steering.
  • Adversarial robustness: Products that expose LLM APIs to untrusted users (e.g., chat platforms) may benefit from ESR as an additional barrier against prompt injection attacks that try to hijack model behavior via activation manipulation.

Limitations & Future Work

  • Model scope: Experiments focus on the Llama‑3 family and Gemma‑2; it remains unclear how ESR manifests in other architectures (e.g., GPT‑4, Claude).
  • Latency overhead: Steering via SAE latents adds a small per‑token computation cost, which could be non‑trivial in high‑throughput services.
  • Granularity of control: Zero‑ablation reduces ESR but does not eliminate it, indicating additional undiscovered circuits.
  • Safety trade‑offs: Amplifying ESR improves self‑correction but may also make it harder to enforce external safety constraints; balancing these forces needs systematic study.
  • Future directions: The authors suggest (1) expanding the latent‑analysis to multimodal models, (2) building automated tools that toggle ESR on/off based on task requirements, and (3) exploring how ESR interacts with reinforcement‑learning‑from‑human‑feedback pipelines.

Authors

  • Alex McKenzie
  • Keenan Pepper
  • Stijn Servaes
  • Martin Leitgab
  • Murat Cubuktepe
  • Mike Vaiana
  • Diogo de Lucena
  • Judd Rosenblatt
  • Michael S. A. Graziano

Paper Information

  • arXiv ID: 2602.06941v1
  • Categories: cs.LG, cs.AI, cs.CL
  • Published: February 6, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »