[Paper] DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

Published: 3 days ago (February 6, 2026 at 01:49 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.06949v1

Overview

DreamDojo is a foundation world model for robots that learns how objects and environments behave by watching 44 000 hours of egocentric human video. By turning raw video into a predictive “simulation” engine, the model can anticipate the physical consequences of robot actions—opening the door to more flexible, data‑efficient robot learning across many tasks.

Key Contributions

Largest video pre‑training corpus for robot world models – 44 k h of diverse, everyday‑scene footage covering thousands of objects and interaction styles.
Continuous latent‑action representation – a unified proxy for “action” that can be inferred from unlabeled video, sidestepping the need for costly hand‑annotated action labels.
Two‑stage training pipeline – (1) massive self‑supervised pre‑training on human video, then (2) lightweight fine‑tuning on a few minutes of robot‑specific data.
Real‑time distilled model – a knowledge‑distillation step compresses the massive model to run at ≈10.8 FPS, preserving fidelity while enabling live use.
Demonstrated downstream utilities – live tele‑operation assistance, zero‑shot policy evaluation, and model‑based planning for contact‑rich manipulation.

Methodology

Data Collection & Curation – The authors aggregate publicly available egocentric video streams (e.g., EPIC‑Kitchens, Ego4D) and filter them for clear hand‑object interactions, yielding a heterogeneous dataset of daily‑life tasks.
Self‑Supervised World Modeling – Using a transformer‑based video encoder‑decoder, the model predicts future RGB‑D frames conditioned on a latent action vector. The latent action is learned jointly with the visual dynamics via a contrastive objective that aligns similar interaction sequences.
Latent Action Proxy – Because most videos lack explicit robot commands, the system treats the latent vector as a continuous “action code” that captures the intent behind the observed hand motion. This code can later be mapped to robot joint commands during fine‑tuning.
Fine‑Tuning on Robot Data – A small, task‑specific robot dataset (minutes of tele‑operated demonstrations) is used to align the latent action space with the robot’s actual control space (e.g., joint torques or end‑effector velocities).
Distillation for Speed – A smaller student network is trained to mimic the teacher’s predictions, achieving real‑time inference while improving temporal consistency through a teacher‑student consistency loss.

Results & Findings

Evaluation	Setting	Metric	Outcome
Physics Prediction	OOD tabletop scenes (unseen objects)	Mean Squared Error on object pose	30 % lower error vs. prior video‑prediction baselines
Action Controllability	Fine‑tuned robot arm on pick‑place	Success rate	87 % success (vs. 62 % for a model trained from scratch)
Real‑Time Distillation	Inference on a commodity GPU	FPS	10.81 FPS (≈5× faster than teacher)
Downstream Tasks	Tele‑operation assistance (predictive overlay)	User task completion time	22 % faster than raw tele‑op without prediction

The model consistently generalized to out‑of‑distribution scenarios—new objects, novel kitchen layouts, and different lighting—showing that the massive human video pre‑training endowed it with a robust physics intuition.

Practical Implications

Accelerated Robot Development – Teams can bootstrap a robot’s world model with only a few minutes of task‑specific data, dramatically cutting the data‑collection bottleneck.
Live Tele‑Operation Aid – Predictive visual overlays can warn operators of imminent collisions or suggest corrective motions, improving safety and efficiency in remote manipulation (e.g., warehouse picking, surgical assistance).
Model‑Based Planning at Scale – Because DreamDojo can simulate contact‑rich interactions, planners can evaluate thousands of candidate trajectories offline before executing the best one on the hardware.
Cross‑Domain Transfer – The latent‑action abstraction enables re‑using the same pretrained model across different robot morphologies (arms, mobile manipulators) with minimal adaptation.
Foundation Model Ecosystem – DreamDojo positions itself as a “GPT‑for‑robotics” analogue, inviting community‑driven fine‑tuning for niche domains (e.g., household chores, assembly lines).

Limitations & Future Work

Domain Gap – Although fine‑tuning bridges much of the gap, extreme differences (e.g., underwater robotics, high‑speed industrial machining) still challenge the model’s transferability.
Action Granularity – The latent action proxy captures coarse intent; very fine‑grained force control (e.g., delicate threading) may require additional supervision.
Compute‑Heavy Pre‑Training – The 44 k h pre‑training demands large GPU clusters, which may limit reproducibility for smaller labs.
Safety Guarantees – Predictive errors in safety‑critical contexts (human‑robot collaboration) need formal verification, an area the authors earmark for future research.

Future directions include expanding the video corpus to non‑egocentric viewpoints, integrating multimodal cues (audio, tactile), and exploring continual learning pipelines that let robots keep updating their world model as they operate in the field.

Authors

Shenyuan Gao
William Liang
Kaiyuan Zheng
Ayaan Malik
Seonghyeon Ye
Sihyun Yu
Wei-Cheng Tseng
Yuzhu Dong
Kaichun Mo
Chen-Hsuan Lin
Qianli Ma
Seungjun Nah
Loic Magne
Jiannan Xiang
Yuqi Xie
Ruijie Zheng
Dantong Niu
You Liang Tan
K. R. Zentner
George Kurian
Suneel Indupuru
Pooya Jannaty
Jinwei Gu
Jun Zhang
Jitendra Malik
Pieter Abbeel
Ming-Yu Liu
Yuke Zhu
Joel Jang
Linxi “Jim” Fan

Paper Information

arXiv ID: 2602.06949v1
Categories: cs.RO, cs.AI, cs.CV, cs.LG
Published: February 6, 2026
PDF: Download PDF

[Paper] DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Reliable Mislabel Detection for Video Capsule Endoscopy Data

[Paper] PANC: Prior-Aware Normalized Cut for Object Segmentation

[Paper] Vision Transformer Finetuning Benefits from Non-Smooth Components

[Paper] NanoFLUX: Distillation-Driven Compression of Large Text-to-Image Generation Models for Mobile Devices