[Paper] End-to-End Test-Time Training for Long Context

Published: 1 week ago (December 29, 2025 at 01:30 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.23675v1

Overview

The paper reframes long‑context language modeling as a continual‑learning problem instead of relying on ever‑larger attention mechanisms. By letting a standard Transformer with a sliding‑window attention keep learning at inference time—predicting the next token on the fly—it compresses the massive context directly into its weights. A meta‑learning step during training prepares the model for this test‑time adaptation, yielding an End‑to‑End Test‑Time Training (TTT‑E2E) approach that scales like full‑attention Transformers while keeping inference latency constant.

Key Contributions

Continual‑learning formulation for long‑context LM: treats the incoming context as a stream that the model continuously updates on.
Test‑time training loop that performs next‑token prediction on the current context, effectively writing the context into the model’s parameters.
Meta‑learning pre‑training that optimizes the model’s initial weights for rapid adaptation during inference.
Empirical scaling study up to 3 B‑parameter models trained on 164 B tokens, showing TTT‑E2E matches full‑attention scaling but with constant inference cost.
Speed advantage: 2.7× faster than a full‑attention Transformer for a 128 K token window, matching the latency profile of RNN‑style models.
Open‑source release of code and training recipes, enabling reproducibility and community extensions.

Methodology

Base Architecture – A vanilla Transformer equipped only with a sliding‑window attention (e.g., 4‑K token window). This keeps the per‑step compute bounded regardless of total context length.
Meta‑learning pre‑training – Before standard language‑model training, the authors run a meta‑learning phase (similar to MAML). The objective is to find an initialization that can be quickly fine‑tuned with a few gradient steps on a new chunk of text.
Test‑time training loop – At inference, the model processes the input sequentially. For each new token, it:
- Performs a forward pass to predict the next token (standard LM loss).
- Takes a gradient step on that loss, updating its own weights in‑place.
- Slides the attention window forward, discarding the oldest tokens.
In effect, the model “writes” the long context into its parameters as it moves forward, so later predictions benefit from the entire history without ever attending to it directly.
End‑to‑End (E2E) training – The same gradient‑based update used at test time is also part of the training objective, ensuring the model learns to improve itself while generating text.

Results & Findings

Model	Context Length	Per‑token Latency	Scaling Trend (Perf vs. Length)
Full‑attention Transformer (baseline)	up to 128 K	↑ linearly with length	Performance improves with length, but latency explodes
Mamba‑2 / Gated‑DeltaNet	up to 128 K	~constant	Performance plateaus early, unable to exploit very long context
TTT‑E2E (this work)	up to 128 K	constant (RNN‑like)	Matches full‑attention scaling – perplexity keeps dropping as context grows

For a 3 B‑parameter model trained on 164 B tokens, TTT‑E2E achieved the same perplexity reduction as a full‑attention Transformer when increasing context from 8 K to 128 K tokens.
Inference speed: at 128 K context, TTT‑E2E was 2.7× faster than the full‑attention baseline while delivering comparable quality.
Ablation studies confirmed that both the meta‑learning initialization and the test‑time gradient updates are essential; removing either degrades scaling behavior.

Practical Implications

Cost‑effective long‑context LMs – Developers can deploy a modest‑size Transformer (e.g., 3 B parameters) and still reap the benefits of 100 K‑token contexts without the memory blow‑up of full attention.
Real‑time applications – Chatbots, code assistants, or document‑analysis tools that need to ingest large bodies of text can maintain low latency, making them suitable for interactive settings.
Edge deployment – Since the per‑step compute stays bounded, the approach is amenable to hardware with limited memory (e.g., GPUs with 16 GB VRAM or even specialized inference chips).
Continual learning pipelines – The test‑time training loop can be extended to adapt to domain‑specific vocabularies or user‑specific data on the fly, opening doors for personalized LMs without full fine‑tuning.
Compatibility – No exotic architectures are required; existing Transformer codebases can be retro‑fitted with the meta‑learning and test‑time update hooks.

Limitations & Future Work

Gradient overhead – Although latency stays constant, each token still incurs a backward pass, which can be heavier on GPUs lacking efficient mixed‑precision autograd pipelines.
Stability of online updates – The method relies on careful learning‑rate scheduling at inference; mis‑tuned rates can cause drift or degrade performance on noisy inputs.
Memory for optimizer states – Storing per‑parameter optimizer moments (e.g., Adam) during test‑time training adds a modest memory footprint.
Scaling beyond 3 B – The paper focuses on models up to 3 B parameters; it remains to be seen how the approach behaves for 10 B+ models where optimizer state size becomes a bottleneck.
Future directions suggested by the authors include: exploring more lightweight update rules (e.g., SGD or low‑rank adapters), integrating retrieval‑augmented mechanisms to further boost long‑range reasoning, and applying the framework to multimodal sequences (audio/video) where context length is even more critical.

Authors

Arnuv Tandon
Karan Dalal
Xinhao Li
Daniel Koceja
Marcel Rød
Sam Buchanan
Xiaolong Wang
Jure Leskovec
Sanmi Koyejo
Tatsunori Hashimoto
Carlos Guestrin
Jed McCaleb
Yejin Choi
Yu Sun

Paper Information

arXiv ID: 2512.23675v1
Categories: cs.LG
Published: December 29, 2025
PDF: Download PDF

[Paper] End-to-End Test-Time Training for Long Context

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Embedding Autonomous Agents in Resource-Constrained Robotic Platforms

[Paper] Lightweight Test-Time Adaptation for EMG-Based Gesture Recognition

[Paper] Robust Physics Discovery from Highly Corrupted Data: A PINN Framework Applied to the Nonlinear Schrödinger Equation

[Paper] Agentic Rubrics as Contextual Verifiers for SWE Agents