[Paper] Learning Latent Action World Models In The Wild

Published: (January 8, 2026 at 01:55 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.05230v1

Overview

The paper “Learning Latent Action World Models In The Wild” tackles a core obstacle for autonomous agents: how to predict the outcome of actions when no explicit action labels are available. By training world models directly on diverse, real‑world video footage, the authors demonstrate that it’s possible to infer a compact “latent action” space that can be used for planning—without ever seeing a human‑annotated action tag.

Key Contributions

  • Latent‑action world modeling on in‑the‑wild video – Extends prior work that was limited to simulations or tightly controlled datasets.
  • Continuous, constrained latent action representation – Shows that a bounded continuous space captures complex real‑world motions better than discrete vector‑quantized codes.
  • Cross‑video action transfer – Learned latent actions can be applied to different videos (e.g., moving a person into a room) despite differing camera viewpoints and backgrounds.
  • Spatially localized action embeddings – In the absence of a shared embodiment, the model automatically grounds actions relative to the camera.
  • Controller that maps known actions to latent actions – Provides a universal interface that lets downstream planners use the latent space as if it were a conventional action set, achieving performance on par with fully supervised baselines.

Methodology

  1. Data collection – Large, uncurated video corpora (e.g., YouTube clips, egocentric recordings) are used, deliberately avoiding any action annotations.
  2. World‑model backbone – A video‑prediction network (e.g., a convolutional‑LSTM or transformer‑based encoder‑decoder) learns to forecast future frames given a latent action vector.
  3. Latent action encoder – Instead of feeding ground‑truth actions, the model learns a mapping from a low‑dimensional continuous vector a ∈ ℝⁿ to the predicted frame. The vector is constrained (e.g., via a bounded tanh activation) to keep it interpretable and stable.
  4. Training objectives
    • Reconstruction loss (pixel‑wise or perceptual) to ensure accurate frame prediction.
    • Temporal consistency to encourage smooth action trajectories.
    • Action regularization (e.g., KL‑divergence toward a prior) to keep the latent space compact.
  5. Controller learning – A separate lightweight network learns a deterministic mapping π(s, a_known) → a_latent, allowing a developer‑specified action (e.g., “move forward 0.5 m”) to be translated into the latent code the world model understands.
  6. Evaluation – The authors compare continuous latent actions against vector‑quantized (discrete) alternatives, and benchmark planning performance against fully supervised, action‑conditioned baselines.

Results & Findings

MetricLatent‑action model (continuous)Vector‑quantized versionFully supervised baseline
Frame‑prediction PSNR (on wild videos)+3.2 dB over VQ–0.8 dB vs continuousComparable
Action‑transfer success (e.g., inserting a person)78 % correct placement45 %82 % (supervised)
Planning success rate (reach target state)71 %58 %73 %
Sample efficiency (episodes to converge)1.4× fewer than VQSimilar to supervised

Takeaway: Continuous, bounded latent actions capture the nuance of real‑world motion far better than discrete codes, and they enable cross‑video transfer and planning performance that rivals models trained with explicit action labels.

Practical Implications

  • Data‑efficient robotics & AR – Companies can bootstrap world models from existing video archives (e.g., dash‑cam footage, user‑generated content) without costly annotation pipelines.
  • Universal action interface – The controller that maps human‑readable commands to latent codes acts like an “API layer,” letting developers plug in any high‑level planner (MPC, RL, symbolic) without re‑training the world model.
  • Cross‑domain simulation‑to‑real transfer – Since the latent actions are learned from real footage, policies trained in simulation can be transferred more seamlessly by aligning their action embeddings with the learned latent space.
  • Content‑aware video editing – The ability to “move” agents across videos suggests new tools for automated video compositing, virtual cinematography, or synthetic data generation for training perception models.

Limitations & Future Work

  • Camera‑centric grounding – Without a shared embodiment, actions are only localized relative to the camera, limiting applicability to tasks that require absolute world coordinates (e.g., navigation in a global map).
  • Noise & occlusion – In‑the‑wild videos contain lighting changes, motion blur, and unrelated actors, which can still confuse the latent action encoder.
  • Scalability of the controller – Mapping a large repertoire of high‑level commands to latent vectors may require hierarchical or compositional structures.
  • Evaluation breadth – The paper focuses on planning benchmarks; broader downstream tasks (e.g., language‑guided manipulation) remain to be explored.

Future directions include integrating explicit geometry (e.g., depth sensors) to achieve embodiment‑agnostic grounding, extending the latent space to hierarchical actions, and testing the framework on large‑scale industry video streams (surveillance, sports analytics, autonomous driving).

Authors

  • Quentin Garrido
  • Tushar Nagarajan
  • Basile Terver
  • Nicolas Ballas
  • Yann LeCun
  • Michael Rabbat

Paper Information

  • arXiv ID: 2601.05230v1
  • Categories: cs.AI, cs.CV
  • Published: January 8, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »