[Paper] Emergent temporal abstractions in autoregressive models enable hierarchical reinforcement learning

Published: 1 month ago (December 23, 2025 at 01:51 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.20605v1

Overview

This paper shows that large autoregressive models (think GPT‑style language models) can learn temporal abstractions—high‑level “macro‑actions” that span many low‑level steps—by training a secondary, non‑causal controller that directly manipulates the model’s internal activations. By doing reinforcement learning inside the network rather than only on the token outputs, the authors achieve far more efficient exploration on tasks with sparse rewards, opening a path toward hierarchical RL built on top of foundation models.

Key Contributions

Internal‑RL framework: Introduces “internal reinforcement learning,” where a higher‑order controller directly influences the residual‑stream activations of a pretrained autoregressive model.
Temporal abstraction discovery: Demonstrates that the controller learns to compress long sequences of low‑level actions into compact latent controllers (sub‑policies) with learned termination conditions.
Hierarchical composition: Shows that chaining these latent controllers yields efficient exploration and rapid adaptation on new tasks.
Empirical validation: Provides experiments on grid‑world navigation and MuJoCo locomotion benchmarks that exhibit hierarchical structure, where standard token‑by‑token RL fails but internal‑RL succeeds.
Scalable design: The approach works with existing large‑scale pretrained models, requiring only a modest additional controller network and RL fine‑tuning.

Methodology

Base model: Start with a large autoregressive transformer pretrained on next‑token prediction (e.g., a language model). Its hidden‑state “residual stream” is the target for manipulation.
Higher‑order controller: A non‑causal sequence model (e.g., a bidirectional transformer) receives the current state and outputs a control vector at each timestep. This vector is added to the residual stream of the base model, effectively steering its internal dynamics.
Latent actions: The controller’s outputs are interpreted as latent actions (or sub‑policies). Each latent action runs for a variable number of base‑model steps until a learned termination signal fires.
Internal RL loop: The RL algorithm (e.g., PPO or SAC) operates on the latent actions, receiving rewards from the environment only after the latent action terminates. Gradients flow back through the controller into the base model, allowing the whole system to be fine‑tuned end‑to‑end.
Training regime:
- Pretrain the base model on large corpora (standard).
- Freeze or lightly fine‑tune the base model while training the controller with RL on the target task.
- Optionally unfreeze the base model later for joint optimization.

The key insight is that the controller can plan over longer horizons because it directly manipulates the internal representation, bypassing the need to generate each low‑level token/action sequentially.

Results & Findings

Environment	Standard token‑by‑token RL	Internal‑RL (latent controllers)	Observations
2‑D Grid World (sparse goal)	Fails to converge within 1M steps	Solves >90 % of episodes in <200k steps	Learned macro‑moves (e.g., “go to corridor”)
MuJoCo Ant‑Maze (hierarchical navigation)	Stalls on sparse reward	Reaches goal consistently, learns “walk straight”, “turn”, “climb” sub‑policies	Controllers terminate after variable lengths (≈10‑30 low‑level steps)
Transfer to unseen maze layout	Poor generalization	Re‑uses learned controllers, adapts quickly	Demonstrates compositionality of latent actions

Overall, the internal‑RL agents achieve 2–5× faster learning on sparse‑reward tasks and exhibit interpretable sub‑behaviors that align with human‑designed primitives.

Practical Implications

Faster RL fine‑tuning of foundation models: Developers can adapt large language or multimodal models to RL tasks (e.g., robotics, game AI) without the costly token‑by‑token exploration overhead.
Hierarchical skill libraries: The latent controllers act as reusable “skills” that can be stored, shared, and composed across projects, reducing the need to train from scratch for each new environment.
Improved sample efficiency for sparse‑reward problems: Industries like autonomous navigation, warehouse robotics, or dialogue systems (where success signals are rare) can benefit from quicker convergence.
Interpretability & debugging: Since each controller corresponds to a semantically meaningful chunk of behavior, engineers can inspect, edit, or replace specific sub‑policies without retraining the whole model.
Compatibility with existing pipelines: The method plugs into standard RL libraries (e.g., RLlib, Stable‑Baselines) and works with any pretrained autoregressive transformer, making adoption relatively low‑friction.

Limitations & Future Work

Controller size & training cost: Adding a non‑causal sequence model introduces extra parameters and memory overhead, which may be prohibitive for very large base models.
Non‑causal assumption: The higher‑order controller relies on future context (bidirectional attention), limiting its use in strictly online settings where future observations aren’t available.
Task specificity: Experiments focus on environments with clear hierarchical structure; performance on highly stochastic or non‑hierarchical tasks remains unclear.
Scalability to multimodal foundations: Extending internal‑RL to vision‑language or audio‑language models will require careful handling of heterogeneous latent spaces.

Future research directions include lightweight controller architectures, online‑compatible variants, and scaling the approach to real‑world robotics platforms where safety and latency are critical.

Authors

Seijin Kobayashi
Yanick Schimpf
Maximilian Schlegel
Angelika Steger
Maciej Wolczyk
Johannes von Oswald
Nino Scherre
Kaitlin Maile
Guillaume Lajoie
Blake A. Richards
Rif A. Saurous
James Manyika
Blaise Agüera y Arcas
Alexander Meulemans
João Sacramento

Paper Information

arXiv ID: 2512.20605v1
Categories: cs.LG, cs.AI
Published: December 23, 2025
PDF: Download PDF

[Paper] Emergent temporal abstractions in autoregressive models enable hierarchical reinforcement learning

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Agentic Structured Graph Traversal for Root Cause Analysis of Code-related Incidents in Cloud Applications

[Paper] Pruning as a Game: Equilibrium-Driven Sparsification of Neural Networks

[Paper] Explainable Multimodal Regression via Information Decomposition

[Paper] A2P-Vis: an Analyzer-to-Presenter Agentic Pipeline for Visual Insights Generation and Reporting