[Paper] Olaf-World: Orienting Latent Actions for Video World Modeling

Published: 2 days ago (February 10, 2026 at 01:58 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.10104v1

Overview

The paper Olaf-World tackles a core bottleneck in building video‑based world models that can be steered by actions: most large video collections lack explicit action labels. By learning latent actions directly from raw footage, the authors show how to create a control interface that works across wildly different scenes—something previous methods struggled with because their latent actions were tangled with scene‑specific visual cues.

Key Contributions

SeqΔ‑REPA objective – a novel sequence‑level loss that aligns latent actions with observable effect changes (temporal feature differences) extracted from a frozen self‑supervised video encoder.
Olaf‑World pipeline – a scalable pre‑training framework that builds action‑conditioned video world models from massive, unlabeled video corpora.
Cross‑context latent action space – the learned actions are organized in a shared coordinate system, enabling zero‑shot transfer to new environments without re‑labeling.
Data‑efficient adaptation – fine‑tuning to a new control interface requires far fewer annotated clips than competing methods.
Extensive empirical validation – experiments on several benchmark video datasets demonstrate superior performance on zero‑shot action transfer and downstream control tasks.

Methodology

Base video encoder – a state‑of‑the‑art self‑supervised model (e.g., MoCo, BYOL) is frozen after pre‑training on raw video. It provides robust frame‑level embeddings.
Latent action generator – a neural module predicts a low‑dimensional “action vector” for each time step, conditioned only on past frames.
Effect alignment (SeqΔ‑REPA) – instead of forcing the latent to reconstruct the next frame, the loss measures how well the difference between consecutive encoder embeddings (Δ‑features) can be predicted from the latent action. Because Δ‑features capture the effect of an action (e.g., a hand moving, an object being displaced), they serve as a universal reference across videos.
World model training – the latent action and a dynamics model are jointly optimized to predict future Δ‑features, effectively learning a controllable latent dynamics space.
Transfer & adaptation – once pre‑trained, the latent action space can be queried directly (zero‑shot) or fine‑tuned with a handful of labeled clips to match a specific control interface (e.g., joystick commands).

Results & Findings

Metric	Olaf‑World	Prior Latent‑Action Baselines
Zero‑shot action classification accuracy (on unseen scenes)	78.4 %	62.1 %
Sample efficiency for fine‑tuning (shots needed for 90 % of peak performance)	5 shots	20 shots
World‑model prediction error (MSE on Δ‑features)	0.018	0.032

The structured latent space yields ~20 % higher zero‑shot transfer accuracy.
Fine‑tuning to a new robot controller or gamepad requires four‑times fewer labeled examples.
Ablation studies confirm that removing the SeqΔ‑REPA loss collapses the latent space back into scene‑specific entanglements.

Practical Implications

Robotics & Simulation – Developers can bootstrap a control model for a new robot arm using only hours of passive video (e.g., YouTube demos) and then quickly adapt it with a few tele‑operated demonstrations.
Game AI & Content Generation – Game studios can train world models that understand “move‑left” or “jump” semantics across diverse level designs without hand‑crafting action annotations for each level.
Video‑based UI Automation – Tools that automate UI interactions (e.g., testing mobile apps) can learn generic click/drag latents from screen‑recordings and apply them to new app versions with minimal re‑training.
Cross‑domain Transfer – Because the latent actions are anchored to observable effects, the same model can be reused for surveillance, sports analytics, or AR/VR experiences, dramatically cutting the data‑labeling cost.

Limitations & Future Work

Reliance on a frozen encoder – The quality of Δ‑features hinges on the pre‑trained self‑supervised encoder; sub‑optimal encoders can limit alignment fidelity.
Temporal granularity – Very fast or subtle actions may produce weak Δ‑signals, making them harder to capture.
Scalability to 3‑D control – The current experiments focus on 2‑D visual effects; extending the framework to full 3‑D pose or force control remains an open challenge.
Future directions suggested by the authors include jointly fine‑tuning the encoder with the alignment loss, exploring multi‑modal effect cues (audio, proprioception), and applying the method to lifelong learning scenarios where new actions continuously appear.

Authors

Yuxin Jiang
Yuchao Gu
Ivor W. Tsang
Mike Zheng Shou

Paper Information

arXiv ID: 2602.10104v1
Categories: cs.CV, cs.AI, cs.LG
Published: February 10, 2026
PDF: Download PDF

[Paper] Olaf-World: Orienting Latent Actions for Video World Modeling

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] UniT: Unified Multimodal Chain-of-Thought Test-time Scaling

[Paper] MonarchRT: Efficient Attention for Real-Time Video Generation

[Paper] Energy-Aware Spike Budgeting for Continual Learning in Spiking Neural Networks for Neuromorphic Vision

[Paper] Towards On-Policy SFT: Distribution Discriminant Theory and its Applications in LLM Training