[Paper] Emergent temporal abstractions in autoregressive models enable hierarchical reinforcement learning

Published: (December 23, 2025 at 01:51 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.20605v1

Overview

This paper shows that large autoregressive models (think GPT‑style language models) can learn temporal abstractions—high‑level “macro‑actions” that span many low‑level steps—by training a secondary, non‑causal controller that directly manipulates the model’s internal activations. By doing reinforcement learning inside the network rather than only on the token outputs, the authors achieve far more efficient exploration on tasks with sparse rewards, opening a path toward hierarchical RL built on top of foundation models.

Key Contributions

  • Internal‑RL framework: Introduces “internal reinforcement learning,” where a higher‑order controller directly influences the residual‑stream activations of a pretrained autoregressive model.
  • Temporal abstraction discovery: Demonstrates that the controller learns to compress long sequences of low‑level actions into compact latent controllers (sub‑policies) with learned termination conditions.
  • Hierarchical composition: Shows that chaining these latent controllers yields efficient exploration and rapid adaptation on new tasks.
  • Empirical validation: Provides experiments on grid‑world navigation and MuJoCo locomotion benchmarks that exhibit hierarchical structure, where standard token‑by‑token RL fails but internal‑RL succeeds.
  • Scalable design: The approach works with existing large‑scale pretrained models, requiring only a modest additional controller network and RL fine‑tuning.

Methodology

  1. Base model: Start with a large autoregressive transformer pretrained on next‑token prediction (e.g., a language model). Its hidden‑state “residual stream” is the target for manipulation.
  2. Higher‑order controller: A non‑causal sequence model (e.g., a bidirectional transformer) receives the current state and outputs a control vector at each timestep. This vector is added to the residual stream of the base model, effectively steering its internal dynamics.
  3. Latent actions: The controller’s outputs are interpreted as latent actions (or sub‑policies). Each latent action runs for a variable number of base‑model steps until a learned termination signal fires.
  4. Internal RL loop: The RL algorithm (e.g., PPO or SAC) operates on the latent actions, receiving rewards from the environment only after the latent action terminates. Gradients flow back through the controller into the base model, allowing the whole system to be fine‑tuned end‑to‑end.
  5. Training regime:
    • Pretrain the base model on large corpora (standard).
    • Freeze or lightly fine‑tune the base model while training the controller with RL on the target task.
    • Optionally unfreeze the base model later for joint optimization.

The key insight is that the controller can plan over longer horizons because it directly manipulates the internal representation, bypassing the need to generate each low‑level token/action sequentially.

Results & Findings

EnvironmentStandard token‑by‑token RLInternal‑RL (latent controllers)Observations
2‑D Grid World (sparse goal)Fails to converge within 1M stepsSolves >90 % of episodes in <200k stepsLearned macro‑moves (e.g., “go to corridor”)
MuJoCo Ant‑Maze (hierarchical navigation)Stalls on sparse rewardReaches goal consistently, learns “walk straight”, “turn”, “climb” sub‑policiesControllers terminate after variable lengths (≈10‑30 low‑level steps)
Transfer to unseen maze layoutPoor generalizationRe‑uses learned controllers, adapts quicklyDemonstrates compositionality of latent actions

Overall, the internal‑RL agents achieve 2–5× faster learning on sparse‑reward tasks and exhibit interpretable sub‑behaviors that align with human‑designed primitives.

Practical Implications

  • Faster RL fine‑tuning of foundation models: Developers can adapt large language or multimodal models to RL tasks (e.g., robotics, game AI) without the costly token‑by‑token exploration overhead.
  • Hierarchical skill libraries: The latent controllers act as reusable “skills” that can be stored, shared, and composed across projects, reducing the need to train from scratch for each new environment.
  • Improved sample efficiency for sparse‑reward problems: Industries like autonomous navigation, warehouse robotics, or dialogue systems (where success signals are rare) can benefit from quicker convergence.
  • Interpretability & debugging: Since each controller corresponds to a semantically meaningful chunk of behavior, engineers can inspect, edit, or replace specific sub‑policies without retraining the whole model.
  • Compatibility with existing pipelines: The method plugs into standard RL libraries (e.g., RLlib, Stable‑Baselines) and works with any pretrained autoregressive transformer, making adoption relatively low‑friction.

Limitations & Future Work

  • Controller size & training cost: Adding a non‑causal sequence model introduces extra parameters and memory overhead, which may be prohibitive for very large base models.
  • Non‑causal assumption: The higher‑order controller relies on future context (bidirectional attention), limiting its use in strictly online settings where future observations aren’t available.
  • Task specificity: Experiments focus on environments with clear hierarchical structure; performance on highly stochastic or non‑hierarchical tasks remains unclear.
  • Scalability to multimodal foundations: Extending internal‑RL to vision‑language or audio‑language models will require careful handling of heterogeneous latent spaces.

Future research directions include lightweight controller architectures, online‑compatible variants, and scaling the approach to real‑world robotics platforms where safety and latency are critical.

Authors

  • Seijin Kobayashi
  • Yanick Schimpf
  • Maximilian Schlegel
  • Angelika Steger
  • Maciej Wolczyk
  • Johannes von Oswald
  • Nino Scherre
  • Kaitlin Maile
  • Guillaume Lajoie
  • Blake A. Richards
  • Rif A. Saurous
  • James Manyika
  • Blaise Agüera y Arcas
  • Alexander Meulemans
  • João Sacramento

Paper Information

  • arXiv ID: 2512.20605v1
  • Categories: cs.LG, cs.AI
  • Published: December 23, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »