[Paper] DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos
Source: arXiv - 2602.06949v1
Overview
DreamDojo is a foundation world model for robots that learns how objects and environments behave by watching 44 000 hours of egocentric human video. By turning raw video into a predictive “simulation” engine, the model can anticipate the physical consequences of robot actions—opening the door to more flexible, data‑efficient robot learning across many tasks.
Key Contributions
- Largest video pre‑training corpus for robot world models – 44 k h of diverse, everyday‑scene footage covering thousands of objects and interaction styles.
- Continuous latent‑action representation – a unified proxy for “action” that can be inferred from unlabeled video, sidestepping the need for costly hand‑annotated action labels.
- Two‑stage training pipeline – (1) massive self‑supervised pre‑training on human video, then (2) lightweight fine‑tuning on a few minutes of robot‑specific data.
- Real‑time distilled model – a knowledge‑distillation step compresses the massive model to run at ≈10.8 FPS, preserving fidelity while enabling live use.
- Demonstrated downstream utilities – live tele‑operation assistance, zero‑shot policy evaluation, and model‑based planning for contact‑rich manipulation.
Methodology
-
Data Collection & Curation – The authors aggregate publicly available egocentric video streams (e.g., EPIC‑Kitchens, Ego4D) and filter them for clear hand‑object interactions, yielding a heterogeneous dataset of daily‑life tasks.
-
Self‑Supervised World Modeling – Using a transformer‑based video encoder‑decoder, the model predicts future RGB‑D frames conditioned on a latent action vector. The latent action is learned jointly with the visual dynamics via a contrastive objective that aligns similar interaction sequences.
-
Latent Action Proxy – Because most videos lack explicit robot commands, the system treats the latent vector as a continuous “action code” that captures the intent behind the observed hand motion. This code can later be mapped to robot joint commands during fine‑tuning.
-
Fine‑Tuning on Robot Data – A small, task‑specific robot dataset (minutes of tele‑operated demonstrations) is used to align the latent action space with the robot’s actual control space (e.g., joint torques or end‑effector velocities).
-
Distillation for Speed – A smaller student network is trained to mimic the teacher’s predictions, achieving real‑time inference while improving temporal consistency through a teacher‑student consistency loss.
Results & Findings
| Evaluation | Setting | Metric | Outcome |
|---|---|---|---|
| Physics Prediction | OOD tabletop scenes (unseen objects) | Mean Squared Error on object pose | 30 % lower error vs. prior video‑prediction baselines |
| Action Controllability | Fine‑tuned robot arm on pick‑place | Success rate | 87 % success (vs. 62 % for a model trained from scratch) |
| Real‑Time Distillation | Inference on a commodity GPU | FPS | 10.81 FPS (≈5× faster than teacher) |
| Downstream Tasks | Tele‑operation assistance (predictive overlay) | User task completion time | 22 % faster than raw tele‑op without prediction |
The model consistently generalized to out‑of‑distribution scenarios—new objects, novel kitchen layouts, and different lighting—showing that the massive human video pre‑training endowed it with a robust physics intuition.
Practical Implications
- Accelerated Robot Development – Teams can bootstrap a robot’s world model with only a few minutes of task‑specific data, dramatically cutting the data‑collection bottleneck.
- Live Tele‑Operation Aid – Predictive visual overlays can warn operators of imminent collisions or suggest corrective motions, improving safety and efficiency in remote manipulation (e.g., warehouse picking, surgical assistance).
- Model‑Based Planning at Scale – Because DreamDojo can simulate contact‑rich interactions, planners can evaluate thousands of candidate trajectories offline before executing the best one on the hardware.
- Cross‑Domain Transfer – The latent‑action abstraction enables re‑using the same pretrained model across different robot morphologies (arms, mobile manipulators) with minimal adaptation.
- Foundation Model Ecosystem – DreamDojo positions itself as a “GPT‑for‑robotics” analogue, inviting community‑driven fine‑tuning for niche domains (e.g., household chores, assembly lines).
Limitations & Future Work
- Domain Gap – Although fine‑tuning bridges much of the gap, extreme differences (e.g., underwater robotics, high‑speed industrial machining) still challenge the model’s transferability.
- Action Granularity – The latent action proxy captures coarse intent; very fine‑grained force control (e.g., delicate threading) may require additional supervision.
- Compute‑Heavy Pre‑Training – The 44 k h pre‑training demands large GPU clusters, which may limit reproducibility for smaller labs.
- Safety Guarantees – Predictive errors in safety‑critical contexts (human‑robot collaboration) need formal verification, an area the authors earmark for future research.
Future directions include expanding the video corpus to non‑egocentric viewpoints, integrating multimodal cues (audio, tactile), and exploring continual learning pipelines that let robots keep updating their world model as they operate in the field.
Authors
- Shenyuan Gao
- William Liang
- Kaiyuan Zheng
- Ayaan Malik
- Seonghyeon Ye
- Sihyun Yu
- Wei-Cheng Tseng
- Yuzhu Dong
- Kaichun Mo
- Chen-Hsuan Lin
- Qianli Ma
- Seungjun Nah
- Loic Magne
- Jiannan Xiang
- Yuqi Xie
- Ruijie Zheng
- Dantong Niu
- You Liang Tan
- K. R. Zentner
- George Kurian
- Suneel Indupuru
- Pooya Jannaty
- Jinwei Gu
- Jun Zhang
- Jitendra Malik
- Pieter Abbeel
- Ming-Yu Liu
- Yuke Zhu
- Joel Jang
- Linxi “Jim” Fan
Paper Information
- arXiv ID: 2602.06949v1
- Categories: cs.RO, cs.AI, cs.CV, cs.LG
- Published: February 6, 2026
- PDF: Download PDF