[Paper] Every Token Counts: Generalizing 16M Ultra-Long Context in Large Language Models
Source: arXiv - 2511.23319v1
Overview
The paper “Every Token Counts: Generalizing 16M Ultra‑Long Context in Large Language Models” tackles a fundamental bottleneck in today’s LLMs: the inability to retain and reason over extremely long sequences of text. By introducing a hierarchical sparse attention (HSA) mechanism, the authors build an 8‑billion‑parameter mixture‑of‑experts (MoE) model that can efficiently process up to 16 million tokens—roughly the length of a full book—while still delivering strong performance on standard benchmarks.
Key Contributions
- Hierarchical Sparse Attention (HSA): A novel attention design that combines sparsity, random‑access flexibility, and length‑generalization, enabling efficient scaling to ultra‑long contexts.
- HSA‑UltraLong model: An 8B‑parameter MoE transformer trained on >8 trillion tokens, capable of handling context windows up to 16 M tokens.
- Comprehensive evaluation: Demonstrates parity with full‑attention baselines on in‑domain lengths and >90 % accuracy on diverse in‑context retrieval tasks across both in‑domain and out‑of‑domain ultra‑long sequences.
- Open‑source insights: Provides detailed experimental analysis and a roadmap of open problems for future ultra‑long context research.
Methodology
-
Problem framing: The authors view “memory” in LLMs as a need for three properties:
- Sparsity: Only a small subset of tokens should attend to each other, reducing quadratic cost.
- Random‑access flexibility: The model must retrieve any token on demand, not just in a fixed sliding‑window fashion.
- Length generalization: Training on one context length should transfer to much longer sequences at inference.
-
Hierarchical Sparse Attention (HSA):
- Local layer: Standard dense attention within short windows (e.g., 1 k tokens) to capture fine‑grained relationships.
- Global layer: Sparse “summary” tokens are generated for each local block; these summaries attend to each other, allowing information to propagate across the entire sequence with only O(N) cost.
- Random access: Any token can be retrieved by traversing the hierarchy, akin to a tree lookup, preserving flexibility.
-
Model architecture: HSA replaces the vanilla self‑attention blocks in a Transformer‑MoE backbone. The MoE routing further scales capacity without a proportional increase in compute.
-
Training regime:
- Data: >8 trillion tokens from diverse web corpora, ensuring exposure to long documents.
- Curriculum: Starts with shorter contexts and progressively increases length, encouraging length generalization.
- Optimization: Standard AdamW with mixture‑of‑experts balancing, plus regularization to keep the sparse attention stable.
Results & Findings
| Evaluation | Context Length | Full‑Attention Baseline | HSA‑UltraLong |
|---|---|---|---|
| Language Modeling (perplexity) | 2 k – 8 k | Comparable | Comparable |
| In‑context Retrieval (accuracy) | 1 M – 16 M | Degrades sharply | >90 % across the board |
| Zero‑shot QA (long documents) | 4 M | Fails (out‑of‑memory) | Successful, near‑baseline quality |
| Out‑of‑domain (legal contracts, codebases) | 8 M – 16 M | Unusable | Robust, retains >85 % performance |
Takeaway: HSA‑UltraLong matches dense attention on short‑range tasks while dramatically outperforming it on ultra‑long contexts, confirming that sparsity plus hierarchical routing preserves essential information without the quadratic blow‑up.
Practical Implications
- Enterprise document processing: Companies can feed entire policy manuals, legal contracts, or code repositories (tens of MB) into a single forward pass, enabling accurate retrieval, summarization, or QA without chunking.
- Long‑form content generation: Authors and developers can prompt the model with a full draft (e.g., a novel manuscript) and receive coherent continuation or editing suggestions that respect earlier chapters.
- Tooling for developers: IDEs could integrate a single LLM that indexes an entire codebase (millions of lines) for context‑aware autocomplete, refactoring, or bug‑explanation, reducing the need for external vector stores.
- Cost‑effective scaling: Because HSA keeps compute roughly linear in token count, cloud providers can offer “ultra‑long context” endpoints at a fraction of the cost of naive dense‑attention models.
Limitations & Future Work
- Memory footprint: Although linear, the hierarchical representation still requires several GB of GPU memory for 16 M tokens, limiting deployment on commodity hardware.
- Latency: The two‑stage (local + global) attention adds a modest overhead compared to pure dense attention on short sequences; optimizing the hierarchy for low‑latency inference remains an open challenge.
- Generalization to multimodal data: The current work focuses on pure text; extending HSA to vision‑language or audio streams will need additional research.
- Robustness to adversarial prompts: Sparse attention may miss rare long‑range dependencies; future work should explore hybrid schemes that dynamically densify attention when needed.
Bottom line: By proving that “every token counts” even at the 16‑million‑token scale, this paper opens the door for LLMs that truly remember—turning the dream of a single, unified model for books, codebases, and massive logs into a practical reality.
Authors
- Xiang Hu
- Zhanchao Zhou
- Ruiqi Liang
- Zehuan Li
- Wei Wu
- Jianguo Li
Paper Information
- arXiv ID: 2511.23319v1
- Categories: cs.CL, cs.AI
- Published: November 28, 2025
- PDF: Download PDF