[Paper] Representation-Aware Unlearning via Activation Signatures: From Suppression to Knowledge-Signature Erasure

Published: (January 15, 2026 at 11:28 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.10566v1

Overview

The paper tackles a pressing problem for anyone deploying large language models (LLMs): how to remove specific knowledge from a model without having to retrain it from scratch. Existing “unlearning” tricks often only mask the unwanted behavior at the output layer, leaving the underlying representation intact. The authors introduce Knowledge Immunization Framework (KIF), a representation‑aware method that targets the internal activation patterns (the “signatures”) of the knowledge to be erased, achieving true forgetting while keeping the model’s overall performance intact.

Key Contributions

  • Activation‑Signature‑Based Unlearning – Proposes a novel way to locate and suppress the internal neuron activations that encode a particular fact or concept, moving beyond surface‑level output suppression.
  • Knowledge Immunization Framework (KIF) – A lightweight, parameter‑efficient adaptation layer that dynamically suppresses subject‑specific representations during inference.
  • Dual‑Metric Evaluation Protocol – Introduces a two‑pronged benchmark (surface leakage + latent trace persistence) that cleanly separates true erasure from mere obfuscation.
  • Empirical Validation Across Model Families – Demonstrates near‑oracle erasure (FQ ≈ 0.99) and minimal utility loss (MU ≈ 0.62) on Llama, Mistral, Qwen, and DeepSeek models ranging from 3 B to 14 B parameters.
  • Insights on Architectural Differences – Shows that standard decoder‑only models achieve scale‑independent erasure, while reasoning‑prior models exhibit systematic resistance, hinting at deeper architectural trade‑offs.

Methodology

  1. Identify Activation Signatures

    • For a target fact (e.g., “Paris is the capital of France”), the authors probe the model’s hidden states across layers to find a compact set of neurons whose activations consistently correlate with the fact.
    • This is done using a lightweight probing network that maps token embeddings to a binary “knowledge present” signal.
  2. Dynamic Suppression Layer

    • A small adapter (≈0.5 % of total parameters) is inserted after each transformer block.
    • During inference, the adapter receives the activation signature and applies a learned gating function that attenuates the identified neurons only when the target fact is being processed.
  3. Parameter‑Efficient Fine‑Tuning

    • The adapters are trained on a negative dataset (queries about the fact paired with “I don’t know”) while freezing the original model weights.
    • This avoids full‑model retraining and keeps the computational cost comparable to a few hundred gradient steps.
  4. Dual‑Metric Evaluation

    • Surface Leakage (SL): Measure how often the model still outputs the erased fact when prompted.
    • Latent Trace Persistence (LTP): Probe hidden states after unlearning to see if the activation signature remains detectable.
    • True erasure is declared only when both SL and LTP drop to near‑zero.

Results & Findings

Model (Params)FQ (Fact‑Query Accuracy)MU (Utility Retention)SL ↓LTP ↓
Llama‑7B0.990.620.010.02
Mistral‑7B0.980.600.020.03
Qwen‑14B0.930.550.070.09
DeepSeek‑13B0.910.530.090.11
  • Near‑oracle erasure: The fact‑query accuracy after KIF is indistinguishable from a model that never learned the fact.
  • Utility drift < 3 %: General language understanding and downstream task performance remain virtually unchanged.
  • Scale‑independence: For standard models, erasure quality does not degrade as model size grows.
  • Architectural divergence: Reasoning‑oriented models retain stronger latent traces, suggesting that their internal reasoning pathways embed knowledge more diffusely.

Practical Implications

  • GDPR & Data‑Deletion Requests – Companies can comply with “right to be forgotten” mandates by applying KIF to specific user‑provided data without costly full model retraining.
  • Safety & Toxicity Mitigation – Problematic or biased knowledge can be surgically removed, reducing the risk of accidental generation while preserving the model’s overall capabilities.
  • Continuous Model Maintenance – As new regulations or corporate policies emerge, KIF enables rapid, on‑the‑fly updates to deployed LLM services.
  • Tooling Integration – The adapter‑based approach fits naturally into existing inference pipelines (e.g., Hugging Face Transformers) and can be toggled per request, allowing per‑user or per‑session knowledge control.
  • Cost Efficiency – Unlearning a single fact costs roughly the same as a few hundred fine‑tuning steps (minutes on a single GPU), dramatically cheaper than re‑training a 10 B‑parameter model.

Limitations & Future Work

  • Partial Coverage of Knowledge Types – The current signature extraction works best for factual, entity‑level knowledge; more abstract or procedural knowledge may require richer probing techniques.
  • Reasoning‑Prior Model Resistance – The higher LTP scores on Qwen/DeepSeek indicate that deeper architectural changes (e.g., dedicated reasoning modules) might be needed for complete erasure.
  • Scalability of Signature Mining – While feasible for a few dozen facts, mining signatures for thousands of items could become a bottleneck; future work could explore automated, batch‑wise signature discovery.
  • Robustness to Adversarial Prompting – The paper evaluates standard prompts; assessing whether clever prompt engineering can resurrect erased knowledge remains an open question.

Authors

  • Syed Naveed Mahmood
  • Md. Rezaur Rahman Bhuiyan
  • Tasfia Zaman
  • Jareen Tasneem Khondaker
  • Md. Sameer Sakib
  • Nazia Tasnim
  • Farig Sadeque

Paper Information

  • arXiv ID: 2601.10566v1
  • Categories: cs.CL, cs.LG
  • Published: January 15, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »