[Paper] Every Token Counts: Generalizing 16M Ultra-Long Context in Large Language Models

Published: 2 months ago (November 28, 2025 at 11:17 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2511.23319v1

Overview

The paper “Every Token Counts: Generalizing 16M Ultra‑Long Context in Large Language Models” tackles a fundamental bottleneck in today’s LLMs: the inability to retain and reason over extremely long sequences of text. By introducing a hierarchical sparse attention (HSA) mechanism, the authors build an 8‑billion‑parameter mixture‑of‑experts (MoE) model that can efficiently process up to 16 million tokens—roughly the length of a full book—while still delivering strong performance on standard benchmarks.

Key Contributions

Hierarchical Sparse Attention (HSA): A novel attention design that combines sparsity, random‑access flexibility, and length‑generalization, enabling efficient scaling to ultra‑long contexts.
HSA‑UltraLong model: An 8B‑parameter MoE transformer trained on >8 trillion tokens, capable of handling context windows up to 16 M tokens.
Comprehensive evaluation: Demonstrates parity with full‑attention baselines on in‑domain lengths and >90 % accuracy on diverse in‑context retrieval tasks across both in‑domain and out‑of‑domain ultra‑long sequences.
Open‑source insights: Provides detailed experimental analysis and a roadmap of open problems for future ultra‑long context research.

Methodology

Problem framing: The authors view “memory” in LLMs as a need for three properties:
- Sparsity: Only a small subset of tokens should attend to each other, reducing quadratic cost.
- Random‑access flexibility: The model must retrieve any token on demand, not just in a fixed sliding‑window fashion.
- Length generalization: Training on one context length should transfer to much longer sequences at inference.
Hierarchical Sparse Attention (HSA):
- Local layer: Standard dense attention within short windows (e.g., 1 k tokens) to capture fine‑grained relationships.
- Global layer: Sparse “summary” tokens are generated for each local block; these summaries attend to each other, allowing information to propagate across the entire sequence with only O(N) cost.
- Random access: Any token can be retrieved by traversing the hierarchy, akin to a tree lookup, preserving flexibility.
Model architecture: HSA replaces the vanilla self‑attention blocks in a Transformer‑MoE backbone. The MoE routing further scales capacity without a proportional increase in compute.
Training regime:
- Data: >8 trillion tokens from diverse web corpora, ensuring exposure to long documents.
- Curriculum: Starts with shorter contexts and progressively increases length, encouraging length generalization.
- Optimization: Standard AdamW with mixture‑of‑experts balancing, plus regularization to keep the sparse attention stable.

Results & Findings

Evaluation	Context Length	Full‑Attention Baseline	HSA‑UltraLong
Language Modeling (perplexity)	2 k – 8 k	Comparable	Comparable
In‑context Retrieval (accuracy)	1 M – 16 M	Degrades sharply	>90 % across the board
Zero‑shot QA (long documents)	4 M	Fails (out‑of‑memory)	Successful, near‑baseline quality
Out‑of‑domain (legal contracts, codebases)	8 M – 16 M	Unusable	Robust, retains >85 % performance

Takeaway: HSA‑UltraLong matches dense attention on short‑range tasks while dramatically outperforming it on ultra‑long contexts, confirming that sparsity plus hierarchical routing preserves essential information without the quadratic blow‑up.

Practical Implications

Enterprise document processing: Companies can feed entire policy manuals, legal contracts, or code repositories (tens of MB) into a single forward pass, enabling accurate retrieval, summarization, or QA without chunking.
Long‑form content generation: Authors and developers can prompt the model with a full draft (e.g., a novel manuscript) and receive coherent continuation or editing suggestions that respect earlier chapters.
Tooling for developers: IDEs could integrate a single LLM that indexes an entire codebase (millions of lines) for context‑aware autocomplete, refactoring, or bug‑explanation, reducing the need for external vector stores.
Cost‑effective scaling: Because HSA keeps compute roughly linear in token count, cloud providers can offer “ultra‑long context” endpoints at a fraction of the cost of naive dense‑attention models.

Limitations & Future Work

Memory footprint: Although linear, the hierarchical representation still requires several GB of GPU memory for 16 M tokens, limiting deployment on commodity hardware.
Latency: The two‑stage (local + global) attention adds a modest overhead compared to pure dense attention on short sequences; optimizing the hierarchy for low‑latency inference remains an open challenge.
Generalization to multimodal data: The current work focuses on pure text; extending HSA to vision‑language or audio streams will need additional research.
Robustness to adversarial prompts: Sparse attention may miss rare long‑range dependencies; future work should explore hybrid schemes that dynamically densify attention when needed.

Bottom line: By proving that “every token counts” even at the 16‑million‑token scale, this paper opens the door for LLMs that truly remember—turning the dream of a single, unified model for books, codebases, and massive logs into a practical reality.

Authors

Xiang Hu
Zhanchao Zhou
Ruiqi Liang
Zehuan Li
Wei Wu
Jianguo Li

Paper Information

arXiv ID: 2511.23319v1
Categories: cs.CL, cs.AI
Published: November 28, 2025
PDF: Download PDF

[Paper] Every Token Counts: Generalizing 16M Ultra-Long Context in Large Language Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] ThetaEvolve: Test-time Learning on Open Problems

[Paper] MegaChat: A Synthetic Persian Q&A Dataset for High-Quality Sales Chatbot Evaluation

[Paper] Towards Improving Interpretability of Language Model Generation through a Structured Knowledge Discovery Approach

[Paper] Toward Automatic Safe Driving Instruction: A Large-Scale Vision Language Model Approach

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] ThetaEvolve: Test-time Learning on Open Problems

[Paper] MegaChat: A Synthetic Persian Q&amp;A Dataset for High-Quality Sales Chatbot Evaluation

[Paper] Towards Improving Interpretability of Language Model Generation through a Structured Knowledge Discovery Approach

[Paper] Toward Automatic Safe Driving Instruction: A Large-Scale Vision Language Model Approach

[Paper] MegaChat: A Synthetic Persian Q&A Dataset for High-Quality Sales Chatbot Evaluation