[Paper] You Only Index Once: Cross-Layer Sparse Attention with Shared Routing

Published: (June 4, 2026 at 01:54 PM EDT)
2 min read
Source: arXiv

Source: arXiv - 2606.06467v1

Overview

Long-context inference in modern LLMs is increasingly constrained by decoding efficiency, especially in reasoning‑heavy settings where models generate long intermediate chains of thought. Existing sparse attention methods often face a practical efficiency‑quality trade‑off. Structured block sparse methods typically provide stronger acceleration but incur noticeable quality loss, while token sparse methods are usually more accurate yet deliver limited end‑to‑end speedup because top‑k routing over the full cache remains expensive.

In this work, we propose cross‑layer sparse attention (CLSA), which is built on top of KV‑sharing architectures such as YOCO. The core idea is to share not only the KV cache across cross‑decoder layers, but also the routing index. A single indexer computes token‑level top‑k selection once and reuses the resulting index across layers, thereby preserving the fine‑grained selectivity of token sparse attention while amortizing the routing overhead. The resulting architecture improves all major inference bottlenecks jointly, including pre‑filling, KV‑cache storage, and long‑context decoding.

Experiments across short‑context and long‑context benchmarks show that CLSA is both accurate and efficient, achieving up to 7.6× decoding speedup and 17.1× overall throughput improvement at 128K context. These results suggest a more complete architectural solution for long‑context LLMs that jointly advances model quality and inference efficiency.

Key Contributions

  • cs.CL
  • cs.AI
  • cs.LG

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.CL.

Authors

  • Yutao Sun
  • Yanqi Zhang
  • Li Dong
  • Jianyong Wang
  • Furu Wei

Paper Information

  • arXiv ID: 2606.06467v1
  • Categories: cs.CL, cs.AI, cs.LG
  • Published: June 4, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »