[Paper] World Models Can Leverage Human Videos for Dexterous Manipulation

Published: 14 hours ago (December 15, 2025 at 01:37 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.13644v1

Overview

DexWM (Dexterous Manipulation World Model) is a new AI system that learns to predict how a hand—or a robot gripper—will interact with objects, using massive amounts of publicly‑available video. By training on 900 + hours of human and non‑dexterous robot footage, the model can anticipate the consequences of fine‑grained finger motions, enabling zero‑shot robot manipulation that outperforms state‑of‑the‑art policies on real‑world tasks such as grasping, placing, and reaching.

Key Contributions

Cross‑domain video pre‑training: Leveraged a large, heterogeneous video corpus (human hands + simple robot videos) to overcome the scarcity of dexterous manipulation datasets.
Latent‑space world model for hands: Introduced DexWM, which predicts the next latent state of the scene conditioned on past latent states and detailed finger‑level actions.
Hand‑consistency auxiliary loss: Added a novel loss that explicitly enforces accurate hand pose reconstruction, boosting prediction fidelity for subtle finger motions.
Zero‑shot transfer to a real robot: Demonstrated that a model trained only on video can be deployed on a Franka Panda arm with an Allegro hand, achieving >50 % improvement over Diffusion Policy on a suite of manipulation benchmarks.
Benchmarking against multimodal world models: Showed superior performance compared to prior models conditioned on text, navigation commands, or full‑body actions.

Methodology

Data collection & preprocessing – The authors aggregated ~900 h of video from two sources: (a) human hand‑centric clips (e.g., YouTube tutorials) and (b) robot videos that involve coarse manipulation but no dexterous fingers. Frames were cropped, normalized, and paired with any available action metadata (e.g., joint angles for robot clips).
Latent representation – A convolutional encoder maps each frame to a compact latent vector. This latent space is shared across human and robot domains, allowing the model to learn a unified notion of “hand‑object interaction.”
World‑model dynamics – A recurrent network (e.g., GRU/LSTM) takes a sequence of past latent states and the current dexterous action (30‑DOF finger joint commands) and predicts the next latent state.
Auxiliary hand‑consistency loss – In parallel, a decoder reconstructs the hand pose from the predicted latent state. The loss penalizes discrepancies between the reconstructed pose and the ground‑truth pose (when available) or a kinematic prior, ensuring the latent dynamics retain fine‑grained finger information.
Training regime – The model is trained end‑to‑end with a weighted sum of (i) latent prediction loss, (ii) hand‑consistency loss, and (iii) a regularization term encouraging smooth dynamics. No task‑specific supervision (e.g., “pick‑and‑place”) is used.
Zero‑shot deployment – At inference time, the robot’s controller samples actions, feeds them to DexWM, and uses the predicted future latent states to select actions that lead to a desired goal (e.g., object at target pose). This is done without any additional fine‑tuning on the robot.

Results & Findings

Task (Franka + Allegro)	Diffusion Policy (baseline)	DexWM (zero‑shot)	Relative ↑
Grasping	38 % success	62 %	+64 %
Placing	34 % success	58 %	+71 %
Reaching	45 % success	71 %	+58 %
Average	39 %	63 %	+62 %

Prediction accuracy: On held‑out video sequences, DexWM reduced latent prediction error by ~30 % compared with prior world models that only predict visual features.
Generalization: The model successfully handled unseen objects, novel hand‑object contacts, and tasks it never saw during training, confirming the strength of the learned latent dynamics.
Ablation: Removing the hand‑consistency loss caused a drop of ~15 % in manipulation success, highlighting its importance for fine‑grained control.

Practical Implications

Rapid prototyping of robot skills: Developers can now train a manipulation model on publicly available video, sidestepping the costly data‑collection pipelines that traditionally require instrumented robot runs.
Cross‑platform transfer: Because the latent space is agnostic to the underlying hardware, the same DexWM model can be reused across different robot arms or hand designs with minimal adaptation.
Improved simulation‑to‑real transfer: The world‑model approach predicts future states directly in latent space, which can be integrated into model‑based RL loops or used as a safety “look‑ahead” in real‑time controllers.
Potential for mixed‑reality teleoperation: Human operators could demonstrate a task on video; DexWM would infer the underlying finger motions and generate robot commands, enabling intuitive skill sharing.

Limitations & Future Work

Reliance on pose annotations: The hand‑consistency loss benefits from accurate hand pose data, which is not always available in wild videos; scaling to completely unannotated footage may require self‑supervised pose estimation.
Latency in real‑time control: Running the encoder‑recurrent‑decoder pipeline at high frequency (>30 Hz) on embedded hardware remains a challenge; optimizing inference speed is an open engineering problem.
Generalization to highly dynamic contacts: Extremely fast or impact‑heavy interactions (e.g., hammering) were not covered; extending the model to handle high‑frequency contact dynamics is a promising direction.
Multi‑object scenes: Current experiments focus on single‑object manipulation; scaling to cluttered environments with occlusions will likely need richer scene representations or attention mechanisms.

DexWM shows that massive, off‑the‑shelf video can be turned into a powerful world model for dexterous robot hands, opening a practical pathway for developers to endow robots with fine‑grained manipulation abilities without the traditional data‑collection bottleneck.

Authors

Raktim Gautam Goswami
Amir Bar
David Fan
Tsung-Yen Yang
Gaoyue Zhou
Prashanth Krishnamurthy
Michael Rabbat
Farshad Khorrami
Yann LeCun

Paper Information

arXiv ID: 2512.13644v1
Categories: cs.RO, cs.AI, cs.CV
Published: December 15, 2025
PDF: Download PDF

[Paper] World Models Can Leverage Human Videos for Dexterous Manipulation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] DiffusionBrowser: Interactive Diffusion Previews via Multi-Branch Decoders

[Paper] Feedforward 3D Editing via Text-Steerable Image-to-3D

[Paper] Directional Textual Inversion for Personalized Text-to-Image Generation

[Paper] From Code to Field: Evaluating the Robustness of Convolutional Neural Networks for Disease Diagnosis in Mango Leaves