[Paper] Every Token Counts: Generalizing 16M Ultra-Long Context in Large Language Models

Published: (November 28, 2025 at 11:17 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2511.23319v1

Overview

The paper “Every Token Counts: Generalizing 16M Ultra‑Long Context in Large Language Models” tackles a fundamental bottleneck in today’s LLMs: the inability to retain and reason over extremely long sequences of text. By introducing a hierarchical sparse attention (HSA) mechanism, the authors build an 8‑billion‑parameter mixture‑of‑experts (MoE) model that can efficiently process up to 16 million tokens—roughly the length of a full book—while still delivering strong performance on standard benchmarks.

Key Contributions

  • Hierarchical Sparse Attention (HSA): A novel attention design that combines sparsity, random‑access flexibility, and length‑generalization, enabling efficient scaling to ultra‑long contexts.
  • HSA‑UltraLong model: An 8B‑parameter MoE transformer trained on >8 trillion tokens, capable of handling context windows up to 16 M tokens.
  • Comprehensive evaluation: Demonstrates parity with full‑attention baselines on in‑domain lengths and >90 % accuracy on diverse in‑context retrieval tasks across both in‑domain and out‑of‑domain ultra‑long sequences.
  • Open‑source insights: Provides detailed experimental analysis and a roadmap of open problems for future ultra‑long context research.

Methodology

  1. Problem framing: The authors view “memory” in LLMs as a need for three properties:

    • Sparsity: Only a small subset of tokens should attend to each other, reducing quadratic cost.
    • Random‑access flexibility: The model must retrieve any token on demand, not just in a fixed sliding‑window fashion.
    • Length generalization: Training on one context length should transfer to much longer sequences at inference.
  2. Hierarchical Sparse Attention (HSA):

    • Local layer: Standard dense attention within short windows (e.g., 1 k tokens) to capture fine‑grained relationships.
    • Global layer: Sparse “summary” tokens are generated for each local block; these summaries attend to each other, allowing information to propagate across the entire sequence with only O(N) cost.
    • Random access: Any token can be retrieved by traversing the hierarchy, akin to a tree lookup, preserving flexibility.
  3. Model architecture: HSA replaces the vanilla self‑attention blocks in a Transformer‑MoE backbone. The MoE routing further scales capacity without a proportional increase in compute.

  4. Training regime:

    • Data: >8 trillion tokens from diverse web corpora, ensuring exposure to long documents.
    • Curriculum: Starts with shorter contexts and progressively increases length, encouraging length generalization.
    • Optimization: Standard AdamW with mixture‑of‑experts balancing, plus regularization to keep the sparse attention stable.

Results & Findings

EvaluationContext LengthFull‑Attention BaselineHSA‑UltraLong
Language Modeling (perplexity)2 k – 8 kComparableComparable
In‑context Retrieval (accuracy)1 M – 16 MDegrades sharply>90 % across the board
Zero‑shot QA (long documents)4 MFails (out‑of‑memory)Successful, near‑baseline quality
Out‑of‑domain (legal contracts, codebases)8 M – 16 MUnusableRobust, retains >85 % performance

Takeaway: HSA‑UltraLong matches dense attention on short‑range tasks while dramatically outperforming it on ultra‑long contexts, confirming that sparsity plus hierarchical routing preserves essential information without the quadratic blow‑up.

Practical Implications

  • Enterprise document processing: Companies can feed entire policy manuals, legal contracts, or code repositories (tens of MB) into a single forward pass, enabling accurate retrieval, summarization, or QA without chunking.
  • Long‑form content generation: Authors and developers can prompt the model with a full draft (e.g., a novel manuscript) and receive coherent continuation or editing suggestions that respect earlier chapters.
  • Tooling for developers: IDEs could integrate a single LLM that indexes an entire codebase (millions of lines) for context‑aware autocomplete, refactoring, or bug‑explanation, reducing the need for external vector stores.
  • Cost‑effective scaling: Because HSA keeps compute roughly linear in token count, cloud providers can offer “ultra‑long context” endpoints at a fraction of the cost of naive dense‑attention models.

Limitations & Future Work

  • Memory footprint: Although linear, the hierarchical representation still requires several GB of GPU memory for 16 M tokens, limiting deployment on commodity hardware.
  • Latency: The two‑stage (local + global) attention adds a modest overhead compared to pure dense attention on short sequences; optimizing the hierarchy for low‑latency inference remains an open challenge.
  • Generalization to multimodal data: The current work focuses on pure text; extending HSA to vision‑language or audio streams will need additional research.
  • Robustness to adversarial prompts: Sparse attention may miss rare long‑range dependencies; future work should explore hybrid schemes that dynamically densify attention when needed.

Bottom line: By proving that “every token counts” even at the 16‑million‑token scale, this paper opens the door for LLMs that truly remember—turning the dream of a single, unified model for books, codebases, and massive logs into a practical reality.

Authors

  • Xiang Hu
  • Zhanchao Zhou
  • Ruiqi Liang
  • Zehuan Li
  • Wei Wu
  • Jianguo Li

Paper Information

  • arXiv ID: 2511.23319v1
  • Categories: cs.CL, cs.AI
  • Published: November 28, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »