[Paper] Fine-Tuning Without Forgetting In-Context Learning: A Theoretical Analysis of Linear Attention Models
Source: arXiv - 2602.23197v1
Overview
Large language models (LLMs) can “learn” a new task on the fly by seeing just a few examples in the prompt—a capability known as in‑context learning (ICL). Practitioners often fine‑tune these models to boost zero‑shot performance on specific downstream tasks, but doing so can unintentionally cripple the model’s ability to perform ICL on unseen tasks. This paper provides a clean theoretical lens—using linear attention models—to explain why fine‑tuning sometimes erases ICL and proposes simple remedies that keep both zero‑shot strength and few‑shot flexibility.
Key Contributions
- Theoretical characterization of how fine‑tuning modifies the three attention matrices (query, key, value) in linear attention models.
- Proof that updating all attention parameters can degrade ICL, while restricting updates to the value matrix preserves ICL while still improving zero‑shot performance.
- Analysis of an auxiliary few‑shot loss: adding a few‑shot objective during fine‑tuning helps the target task’s ICL but harms general ICL on other tasks.
- Empirical validation on synthetic and real‑world benchmarks confirming the theoretical predictions.
- Practical guidelines for developers who want to fine‑tune LLMs without sacrificing their prompt‑based adaptability.
Methodology
The authors focus on a linear attention variant of the Transformer, where the softmax over the query–key similarity is replaced by a linear kernel (e.g., using feature maps). This simplification makes the mathematics tractable while still capturing the core behavior of attention.
- Model decomposition – The attention operation is expressed as three learnable matrices: (W_Q) (query), (W_K) (key), and (W_V) (value).
- Fine‑tuning objectives – Two regimes are studied:
- Standard fine‑tuning: minimize a task‑specific loss (e.g., cross‑entropy) by updating all three matrices.
- Constrained fine‑tuning: restrict gradient updates to only (W_V) (value matrix).
- Auxiliary few‑shot loss – An extra term that explicitly encourages good performance on a few‑shot version of the target task during fine‑tuning.
- Theoretical analysis – By tracking how the matrices evolve under gradient descent, the authors derive conditions under which the ICL kernel (the effective mapping from prompt examples to predictions) stays close to its pre‑fine‑tuned form.
- Experiments – Synthetic linear regression tasks and real LLM benchmarks (e.g., sentiment analysis, natural language inference) are used to test the predictions.
Results & Findings
| Setting | Zero‑shot performance | In‑context (few‑shot) performance on seen task | In‑context performance on unseen tasks |
|---|---|---|---|
| Pre‑trained (no fine‑tuning) | Baseline | Strong on many tasks | Strong |
| Full fine‑tuning (all matrices) | ↑ (task‑specific gain) | ↓ (degraded) | ↓ (significant drop) |
| Value‑only fine‑tuning | ↑ (similar gain) | ↔ (almost unchanged) | ↔ (preserved) |
| Fine‑tuning + auxiliary few‑shot loss | ↑↑ (best on target) | ↑ (improved on target) | ↓ (worse on other tasks) |
- Full fine‑tuning improves the target task’s zero‑shot accuracy but collapses the ICL kernel, making the model forget how to generalize from prompts.
- Value‑only updates achieve comparable zero‑shot gains without harming the ICL kernel, confirming the theoretical claim that the query/key matrices encode the “prompt‑reading” ability.
- Adding an auxiliary few‑shot loss further boosts ICL on the fine‑tuned task but at the cost of over‑specializing the prompt‑reading mechanism, hurting transfer to other tasks.
Practical Implications
- Fine‑tune with restraint – When you need a model that can both answer zero‑shot queries and still respond to few‑shot prompts, restrict gradient updates to the value matrix (or equivalently, freeze query/key layers). Many modern libraries already support layer‑wise learning‑rate schedules or parameter freezing, making this easy to implement.
- Auxiliary few‑shot loss as a trade‑off – If your product only cares about a single downstream task (e.g., a specialized chatbot), adding a few‑shot loss can give you the best of both worlds for that task. Just be aware you’ll lose general prompt flexibility.
- Model selection – Linear‑attention approximations (e.g., Performer, Linformer) are not just speed tricks; they expose a clean separation between “reading the prompt” (query/key) and “producing the answer” (value). This insight can guide architecture choices for latency‑critical services that still need ICL.
- Debugging fine‑tuned models – If a fine‑tuned LLM suddenly fails on few‑shot prompts, check whether query/key weights were unintentionally updated (e.g., due to optimizer bugs or weight decay). Re‑freezing them often restores ICL.
- Tooling – The paper’s theoretical formulas can be turned into diagnostic metrics (e.g., measuring the distance between pre‑ and post‑fine‑tuned query/key matrices) to automatically flag when ICL is being compromised.
Limitations & Future Work
- Linear attention simplification – Real‑world LLMs use full softmax attention; while the authors argue the insights transfer, empirical confirmation on full‑scale Transformers remains an open step.
- Scope of tasks – Experiments focus on classification‑style benchmarks; generative tasks (code synthesis, story continuation) may exhibit different dynamics.
- Optimization dynamics – The analysis assumes standard gradient descent; alternative optimizers (Adam, LoRA adapters) could interact with the query/key/value separation in non‑trivial ways.
- Scalability of value‑only fine‑tuning – Freezing large query/key layers may limit the capacity to adapt to tasks that truly require representation changes (e.g., domain shift). Future work could explore hybrid schemes (partial fine‑tuning, low‑rank adapters).
By exposing the hidden trade‑off between zero‑shot gains and prompt‑based flexibility, this work equips developers with a principled roadmap for fine‑tuning LLMs that retain their hallmark in‑context learning ability.
Authors
- Chungpa Lee
- Jy‑yong Sohn
- Kangwook Lee
Paper Information
- arXiv ID: 2602.23197v1
- Categories: cs.CL, cs.LG, stat.ML
- Published: February 26, 2026
- PDF: Download PDF