[Paper] Fine-Tuning Without Forgetting In-Context Learning: A Theoretical Analysis of Linear Attention Models

Published: 3 days ago (February 26, 2026 at 11:49 AM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.23197v1

Overview

Large language models (LLMs) can “learn” a new task on the fly by seeing just a few examples in the prompt—a capability known as in‑context learning (ICL). Practitioners often fine‑tune these models to boost zero‑shot performance on specific downstream tasks, but doing so can unintentionally cripple the model’s ability to perform ICL on unseen tasks. This paper provides a clean theoretical lens—using linear attention models—to explain why fine‑tuning sometimes erases ICL and proposes simple remedies that keep both zero‑shot strength and few‑shot flexibility.

Key Contributions

Theoretical characterization of how fine‑tuning modifies the three attention matrices (query, key, value) in linear attention models.
Proof that updating all attention parameters can degrade ICL, while restricting updates to the value matrix preserves ICL while still improving zero‑shot performance.
Analysis of an auxiliary few‑shot loss: adding a few‑shot objective during fine‑tuning helps the target task’s ICL but harms general ICL on other tasks.
Empirical validation on synthetic and real‑world benchmarks confirming the theoretical predictions.
Practical guidelines for developers who want to fine‑tune LLMs without sacrificing their prompt‑based adaptability.

Methodology

The authors focus on a linear attention variant of the Transformer, where the softmax over the query–key similarity is replaced by a linear kernel (e.g., using feature maps). This simplification makes the mathematics tractable while still capturing the core behavior of attention.

Model decomposition – The attention operation is expressed as three learnable matrices: (W_Q) (query), (W_K) (key), and (W_V) (value).
Fine‑tuning objectives – Two regimes are studied:
- Standard fine‑tuning: minimize a task‑specific loss (e.g., cross‑entropy) by updating all three matrices.
- Constrained fine‑tuning: restrict gradient updates to only (W_V) (value matrix).
Auxiliary few‑shot loss – An extra term that explicitly encourages good performance on a few‑shot version of the target task during fine‑tuning.
Theoretical analysis – By tracking how the matrices evolve under gradient descent, the authors derive conditions under which the ICL kernel (the effective mapping from prompt examples to predictions) stays close to its pre‑fine‑tuned form.
Experiments – Synthetic linear regression tasks and real LLM benchmarks (e.g., sentiment analysis, natural language inference) are used to test the predictions.

Results & Findings

Setting	Zero‑shot performance	In‑context (few‑shot) performance on seen task	In‑context performance on unseen tasks
Pre‑trained (no fine‑tuning)	Baseline	Strong on many tasks	Strong
Full fine‑tuning (all matrices)	↑ (task‑specific gain)	↓ (degraded)	↓ (significant drop)
Value‑only fine‑tuning	↑ (similar gain)	↔ (almost unchanged)	↔ (preserved)
Fine‑tuning + auxiliary few‑shot loss	↑↑ (best on target)	↑ (improved on target)	↓ (worse on other tasks)

Full fine‑tuning improves the target task’s zero‑shot accuracy but collapses the ICL kernel, making the model forget how to generalize from prompts.
Value‑only updates achieve comparable zero‑shot gains without harming the ICL kernel, confirming the theoretical claim that the query/key matrices encode the “prompt‑reading” ability.
Adding an auxiliary few‑shot loss further boosts ICL on the fine‑tuned task but at the cost of over‑specializing the prompt‑reading mechanism, hurting transfer to other tasks.

Practical Implications

Fine‑tune with restraint – When you need a model that can both answer zero‑shot queries and still respond to few‑shot prompts, restrict gradient updates to the value matrix (or equivalently, freeze query/key layers). Many modern libraries already support layer‑wise learning‑rate schedules or parameter freezing, making this easy to implement.
Auxiliary few‑shot loss as a trade‑off – If your product only cares about a single downstream task (e.g., a specialized chatbot), adding a few‑shot loss can give you the best of both worlds for that task. Just be aware you’ll lose general prompt flexibility.
Model selection – Linear‑attention approximations (e.g., Performer, Linformer) are not just speed tricks; they expose a clean separation between “reading the prompt” (query/key) and “producing the answer” (value). This insight can guide architecture choices for latency‑critical services that still need ICL.
Debugging fine‑tuned models – If a fine‑tuned LLM suddenly fails on few‑shot prompts, check whether query/key weights were unintentionally updated (e.g., due to optimizer bugs or weight decay). Re‑freezing them often restores ICL.
Tooling – The paper’s theoretical formulas can be turned into diagnostic metrics (e.g., measuring the distance between pre‑ and post‑fine‑tuned query/key matrices) to automatically flag when ICL is being compromised.

Limitations & Future Work

Linear attention simplification – Real‑world LLMs use full softmax attention; while the authors argue the insights transfer, empirical confirmation on full‑scale Transformers remains an open step.
Scope of tasks – Experiments focus on classification‑style benchmarks; generative tasks (code synthesis, story continuation) may exhibit different dynamics.
Optimization dynamics – The analysis assumes standard gradient descent; alternative optimizers (Adam, LoRA adapters) could interact with the query/key/value separation in non‑trivial ways.
Scalability of value‑only fine‑tuning – Freezing large query/key layers may limit the capacity to adapt to tasks that truly require representation changes (e.g., domain shift). Future work could explore hybrid schemes (partial fine‑tuning, low‑rank adapters).

By exposing the hidden trade‑off between zero‑shot gains and prompt‑based flexibility, this work equips developers with a principled roadmap for fine‑tuning LLMs that retain their hallmark in‑context learning ability.

Authors

Chungpa Lee
Jy‑yong Sohn
Kangwook Lee

Paper Information

arXiv ID: 2602.23197v1
Categories: cs.CL, cs.LG, stat.ML
Published: February 26, 2026
PDF: Download PDF

[Paper] Fine-Tuning Without Forgetting In-Context Learning: A Theoretical Analysis of Linear Attention Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] LLM Novice Uplift on Dual-Use, In Silico Biology Tasks

[Paper] SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables

[Paper] AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning

[Paper] Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding?