[Paper] Quantization-Robust LLM Unlearning via Low-Rank Adaptation

Published: 2 months ago (February 13, 2026 at 01:01 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.13151v1

Overview

The paper tackles a practical snag in deploying large language models (LLMs): after you “unlearn” (i.e., delete) specific knowledge from a fine‑tuned model, aggressive post‑training quantization (PTQ) – often required to run the model on edge devices or to cut inference costs – can wipe out those unlearning updates. The authors show that standard full‑parameter fine‑tuning produces weight changes that are too tiny to survive 4‑bit quantization, and they propose a LoRA‑based (Low‑Rank Adaptation) solution that keeps the unlearning effect intact even after quantization.

Key Contributions

Identified quantization‑induced forgetting reversal: Demonstrated that 4‑bit PTQ can restore a model’s pre‑unlearning behavior when using conventional full‑parameter unlearning methods.
LoRA‑based unlearning pipeline: Introduced a workflow that freezes the base LLM and concentrates all unlearning updates into low‑rank adapter modules, making the changes robust to low‑bit quantization.
Empirical gains on Llama‑2‑7B: Achieved up to +7.93 points in 4‑bit utility on the MUSE BOOKS benchmark and +4.76 points on the NEWS benchmark compared to full‑parameter unlearning.
Improved privacy leakage metrics: Showed a dramatic reduction in privacy leakage (e.g., GA+KLR on BOOKS moved from –25.68 to –5.86) while preserving strong forgetting (VerMem & KnowMem ≈ 0).
Open‑source‑ready recipe: Provided a reproducible pipeline that can be plugged into existing PTQ toolchains (e.g., GPTQ, AWQ) with minimal code changes.

Methodology

Baseline unlearning (full‑parameter fine‑tuning):
- The entire LLM is fine‑tuned on a “forget” dataset, aiming to reduce the model’s ability to recall that data.
- After fine‑tuning, the model is quantized to 4‑bit using a standard PTQ algorithm.
LoRA‑based unlearning:
- Freeze the base model (the 7B Llama‑2 weights stay untouched).
- Insert low‑rank adapter matrices (typically rank = 4–8) into each transformer layer.
- Train only the adapters on the forget dataset. Because the adapters are separate, their weight updates are orders of magnitude larger than the tiny changes spread across the whole model.
- After adapter training, apply 4‑bit PTQ to the combined model (base + adapters). The adapters’ larger magnitude updates survive quantization, preserving the unlearning effect.
Evaluation suite:
- Utility: Measured with NPO (Negative Prompt Overlap) + GDR (Generalized Dialogue Recall) on the MUSE BOOKS and NEWS subsets.
- Forgetting: Assessed via VerMem (Verification Memory) and KnowMem (Knowledge Memory) – both should approach zero after successful unlearning.
- Privacy leakage: Quantified with the PrivLeak metric (closer to 0 = less leakage).

The pipeline is deliberately lightweight: training LoRA adapters typically requires < 1 % of the compute of full‑model fine‑tuning, and the adapters add only a few megabytes to the model size.

Results & Findings

Benchmark	Metric	Full‑param (4‑bit)	LoRA (4‑bit)	Δ
MUSE BOOKS	NPO+GDR	50.17	58.10	+7.93
MUSE NEWS	GA+GDR	40.06	44.82	+4.76
Privacy (GA+KLR, BOOKS)	PrivLeak	–25.68	–5.86	+19.82 (much less leakage)
Forgetting	VerMem / KnowMem	≈ 0 (both)	≈ 0 (both)	–

Key takeaways

Utility improves despite the aggressive 4‑bit quantization, indicating that LoRA adapters retain more of the model’s expressive power after unlearning.
Privacy leakage drops dramatically, meaning that an adversary probing the quantized model is far less likely to recover the forgotten data.
Training cost is slashed – LoRA adapters converge in a few hundred steps, whereas full‑parameter fine‑tuning can take thousands.

Practical Implications

Edge & mobile deployment: Companies that ship LLM‑powered features on devices (e.g., on‑device assistants, code completion tools) can now comply with “right‑to‑be‑forgotten” requests without sacrificing the low‑memory footprint that quantization provides.
Regulatory compliance: GDPR‑style data erasure mandates can be met more reliably because the unlearning effect survives the quantization step that is often mandatory for production inference pipelines.
Cost‑effective model updates: Instead of re‑training or fine‑tuning the entire model each time a piece of data must be removed, teams can simply update a small set of adapters and re‑quantize, cutting GPU hours and cloud spend.
Toolchain integration: The approach plugs into existing PTQ libraries (e.g., bitsandbytes, GPTQ) and LoRA frameworks (peft, loralib), making adoption straightforward for developers already familiar with these ecosystems.

Limitations & Future Work

Scope limited to 4‑bit PTQ: The study focuses on 4‑bit quantization; behavior under more extreme quantization (e.g., 2‑bit) or mixed‑precision schemes remains unexplored.
Adapter rank selection: While the paper uses a fixed low rank, optimal rank may vary across model sizes and downstream tasks; an automated rank‑search could improve robustness.
Generalization to other architectures: Experiments are confined to Llama‑2‑7B; applying the method to encoder‑only models (e.g., BERT) or multimodal LLMs may require additional tweaks.
Long‑term forgetting stability: The paper evaluates forgetting shortly after unlearning; future work should assess whether the effect persists after further fine‑tuning or continual learning cycles.

Authors

João Vitor Boer Abitante
Joana Meneguzzo Pasquali
Luan Fonseca Garcia
Ewerton de Oliveira
Thomas da Silva Paula
Rodrigo C. Barros
Lucas S. Kupssinskü

Paper Information

arXiv ID: 2602.13151v1
Categories: cs.LG, cs.CL
Published: February 13, 2026
PDF: Download PDF

[Paper] Quantization-Robust LLM Unlearning via Low-Rank Adaptation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Semantic Chunking and the Entropy of Natural Language

[Paper] CoPE-VideoLM: Codec Primitives For Efficient Video Language Models

[Paper] SCOPE: Selective Conformal Optimized Pairwise LLM Judging

[Paper] LCSB: Layer-Cyclic Selective Backpropagation for Memory-Efficient On-Device LLM Fine-Tuning