[Paper] Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It

Published: 3 days ago (June 9, 2026 at 12:17 PM EDT)

2 min read

Source: arXiv

Source: arXiv - 2606.11052v1

Overview

Chain-of-thought (CoT) supervised fine-tuning (SFT) is widely adopted to improve reasoning ability, yet we find that it systematically degrades long-context recall in hybrid linear-attention models. Across architectures including HypeNet and Jet-Nemotron, retrieval performance on Needle-In-A-Haystack (NIAH) deteriorates substantially after CoT-SFT, and the degradation becomes more severe under harder retrieval settings and longer context windows. For example, HypeNet-9B on NIAH-S2@256K decreases from $67.2%$ to $9.4%$. We attribute this to CoT-SFT biasing attention gradients toward short-range patterns, disrupting query-key projections ($W_Q, W_K$) that are responsible for long-range routing. Motivated by this observation, we propose QK-Restore, a training-free method that restores only $W_Q$ and $W_K$ from the pre-SFT checkpoint while preserving all other post-SFT parameters. We further introduce a Procrustes variant to balance routing preservation and reasoning adaptation. Across architectures, QK-Restore consistently restores long-context capability at zero training cost while preserving reasoning performance; for instance, on HypeNet-5B it improves S3@256K from $65.4%$ to $76.4%$ while maintaining strong reasoning performance.

Key Contributions

This paper presents research in the following areas:

cs.CL

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.CL.

Authors

Xinyu Zhou
Boyu Zhu
Yi Xu
Zhiwei Li
Yingfa Chen
Huiming Wang
Zhijiang Guo

Paper Information

arXiv ID: 2606.11052v1
Categories: cs.CL
Published: June 9, 2026
PDF: Download PDF

[Paper] Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments

[Paper] Learning to Reason by Analogy via Retrieval-Augmented Reinforcement Fine-Tuning

[Paper] Influcoder: Distilling Decoders' Gradient Influence Rankings into an Encoder for Data Attribution

[Paper] HyperTool: Beyond Step-Wise Tool Calls for Tool-Augmented Agents