[Paper] Olaf-World: Orienting Latent Actions for Video World Modeling
Source: arXiv - 2602.10104v1
Overview
The paper Olaf-World tackles a core bottleneck in building video‑based world models that can be steered by actions: most large video collections lack explicit action labels. By learning latent actions directly from raw footage, the authors show how to create a control interface that works across wildly different scenes—something previous methods struggled with because their latent actions were tangled with scene‑specific visual cues.
Key Contributions
- SeqΔ‑REPA objective – a novel sequence‑level loss that aligns latent actions with observable effect changes (temporal feature differences) extracted from a frozen self‑supervised video encoder.
- Olaf‑World pipeline – a scalable pre‑training framework that builds action‑conditioned video world models from massive, unlabeled video corpora.
- Cross‑context latent action space – the learned actions are organized in a shared coordinate system, enabling zero‑shot transfer to new environments without re‑labeling.
- Data‑efficient adaptation – fine‑tuning to a new control interface requires far fewer annotated clips than competing methods.
- Extensive empirical validation – experiments on several benchmark video datasets demonstrate superior performance on zero‑shot action transfer and downstream control tasks.
Methodology
- Base video encoder – a state‑of‑the‑art self‑supervised model (e.g., MoCo, BYOL) is frozen after pre‑training on raw video. It provides robust frame‑level embeddings.
- Latent action generator – a neural module predicts a low‑dimensional “action vector” for each time step, conditioned only on past frames.
- Effect alignment (SeqΔ‑REPA) – instead of forcing the latent to reconstruct the next frame, the loss measures how well the difference between consecutive encoder embeddings (Δ‑features) can be predicted from the latent action. Because Δ‑features capture the effect of an action (e.g., a hand moving, an object being displaced), they serve as a universal reference across videos.
- World model training – the latent action and a dynamics model are jointly optimized to predict future Δ‑features, effectively learning a controllable latent dynamics space.
- Transfer & adaptation – once pre‑trained, the latent action space can be queried directly (zero‑shot) or fine‑tuned with a handful of labeled clips to match a specific control interface (e.g., joystick commands).
Results & Findings
| Metric | Olaf‑World | Prior Latent‑Action Baselines |
|---|---|---|
| Zero‑shot action classification accuracy (on unseen scenes) | 78.4 % | 62.1 % |
| Sample efficiency for fine‑tuning (shots needed for 90 % of peak performance) | 5 shots | 20 shots |
| World‑model prediction error (MSE on Δ‑features) | 0.018 | 0.032 |
- The structured latent space yields ~20 % higher zero‑shot transfer accuracy.
- Fine‑tuning to a new robot controller or gamepad requires four‑times fewer labeled examples.
- Ablation studies confirm that removing the SeqΔ‑REPA loss collapses the latent space back into scene‑specific entanglements.
Practical Implications
- Robotics & Simulation – Developers can bootstrap a control model for a new robot arm using only hours of passive video (e.g., YouTube demos) and then quickly adapt it with a few tele‑operated demonstrations.
- Game AI & Content Generation – Game studios can train world models that understand “move‑left” or “jump” semantics across diverse level designs without hand‑crafting action annotations for each level.
- Video‑based UI Automation – Tools that automate UI interactions (e.g., testing mobile apps) can learn generic click/drag latents from screen‑recordings and apply them to new app versions with minimal re‑training.
- Cross‑domain Transfer – Because the latent actions are anchored to observable effects, the same model can be reused for surveillance, sports analytics, or AR/VR experiences, dramatically cutting the data‑labeling cost.
Limitations & Future Work
- Reliance on a frozen encoder – The quality of Δ‑features hinges on the pre‑trained self‑supervised encoder; sub‑optimal encoders can limit alignment fidelity.
- Temporal granularity – Very fast or subtle actions may produce weak Δ‑signals, making them harder to capture.
- Scalability to 3‑D control – The current experiments focus on 2‑D visual effects; extending the framework to full 3‑D pose or force control remains an open challenge.
- Future directions suggested by the authors include jointly fine‑tuning the encoder with the alignment loss, exploring multi‑modal effect cues (audio, proprioception), and applying the method to lifelong learning scenarios where new actions continuously appear.
Authors
- Yuxin Jiang
- Yuchao Gu
- Ivor W. Tsang
- Mike Zheng Shou
Paper Information
- arXiv ID: 2602.10104v1
- Categories: cs.CV, cs.AI, cs.LG
- Published: February 10, 2026
- PDF: Download PDF