[Paper] Dexterous Manipulation Policies from RGB Human Videos via 4D Hand-Object Trajectory Reconstruction

Published: 3 days ago (February 9, 2026 at 01:56 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.09013v1

Overview

The paper introduces VIDEOMANIP, a novel framework that teaches dexterous multi‑finger robot hands to manipulate objects using only ordinary RGB videos of humans performing the task. By reconstructing 4‑D (3‑D space + time) hand‑object trajectories from monocular footage, the authors bypass the need for expensive motion‑capture suits or specialized sensors, opening the door to scalable, vision‑only robot learning.

Key Contributions

Device‑free data collection – learns manipulation policies directly from off‑the‑shelf RGB videos, eliminating wearables and custom rigs.
4‑D hand‑object trajectory reconstruction – combines state‑of‑the‑art human hand pose estimation and object mesh recovery to produce temporally coherent robot‑ready trajectories.
Contact‑aware retargeting – optimizes the reconstructed motions for robot hands by enforcing realistic hand‑object contact and interaction‑centric grasp modeling.
Demonstration synthesis – generates a diverse set of training trajectories from a single video, dramatically expanding the data without extra human effort.
Real‑world validation – achieves >60 % success on seven real‑world tasks with the LEAP Hand, outperforming prior retargeting pipelines by ~16 %.

Methodology

Video Ingestion – A standard monocular video of a person manipulating an object is fed into a vision pipeline.
Human Hand Pose & Object Mesh Recovery – Off‑the‑shelf CV models (e.g., MANO‑based hand pose estimators, neural implicit object reconstruction) predict a 3‑D hand skeleton and a dense mesh of the object for every frame.
4‑D Trajectory Assembly – The per‑frame estimates are temporally smoothed to produce a continuous hand‑object trajectory (position, orientation, and shape over time).
Retargeting to Robot Hand – The human hand trajectory is mapped onto the robot’s kinematic chain. A contact‑optimization step adjusts joint angles so that the robot fingertips make plausible contact with the object, respecting the robot’s geometry and torque limits.
Demonstration Synthesis – Small perturbations (e.g., varying grasp offsets, object pose jitter) are applied to the retargeted trajectory, yielding many synthetic demonstrations from the original video.
Policy Learning – The synthesized dataset trains a reinforcement‑learning or imitation‑learning policy (e.g., PPO with a vision‑based encoder) that outputs joint commands for the robot hand.

The whole pipeline runs without any physical robot data; the only “real” input is the RGB video.

Results & Findings

Setting	Success Rate	Notes
Simulation (Inspire Hand)	70.25 % across 20 objects	Demonstrates that the reconstructed trajectories are sufficient for learning robust grasps in a physics engine.
Real‑world (LEAP Hand)	62.86 % average over 7 tasks	Outperforms a baseline that directly retargets human motion by 15.87 %.
Ablation	Removing contact optimization drops success by ~12 %	Highlights the importance of interaction‑centric grasp modeling.
Data Efficiency	One video → ~50 synthetic demos → comparable performance to 10× real robot demos	Shows the power of the synthesis step.

These numbers indicate that a single, casually recorded video can bootstrap a functional dexterous manipulation policy.

Practical Implications

Scalable Dataset Creation – Companies can harvest existing YouTube or internal footage to build large, diverse manipulation datasets without costly data‑collection rigs.
Rapid Prototyping – Engineers can test new hand designs or control algorithms by simply recording a few demonstration videos, cutting weeks off the development cycle.
Cross‑Domain Transfer – Because the pipeline works on generic RGB footage, policies can be trained on objects that are hard to instrument (e.g., fragile items) and then transferred to real robots.
Lower Barrier for Startups – Small robotics teams without access to motion‑capture labs can still train high‑DOF hands, democratizing dexterous manipulation research.

Limitations & Future Work

Vision Accuracy Dependency – The quality of hand pose and object mesh estimates directly limits policy performance; occlusions or fast motions still cause errors.
Simulation‑to‑Real Gap – While real‑world tests are promising, the approach relies on accurate physics simulation for policy pre‑training; further domain‑randomization may be needed for broader robustness.
Hand Model Generalization – The current retargeting assumes a specific robot hand geometry (e.g., LEAP, Inspire). Extending to heterogeneous hand designs will require more flexible kinematic mapping.
Complex Tasks – The evaluated tasks involve relatively simple pick‑and‑place or re‑orientation. Future work could explore in‑hand manipulation, tool use, or multi‑object interactions.

If you’re curious to see the system in action, check out the project videos at videomanip.github.io. The code and datasets are slated for open‑source release, which could make vision‑only dexterous learning a new standard in robotics.

Authors

Hongyi Chen
Tony Dong
Tiancheng Wu
Liquan Wang
Yash Jangir
Yaru Niu
Yufei Ye
Homanga Bharadhwaj
Zackory Erickson
Jeffrey Ichnowski

Paper Information

arXiv ID: 2602.09013v1
Categories: cs.RO, cs.CV
Published: February 9, 2026
PDF: Download PDF

[Paper] Dexterous Manipulation Policies from RGB Human Videos via 4D Hand-Object Trajectory Reconstruction

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] SurfPhase: 3D Interfacial Dynamics in Two-Phase Flows from Sparse Videos

[Paper] Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling

[Paper] GENIUS: Generative Fluid Intelligence Evaluation Suite

[Paper] From Circuits to Dynamics: Understanding and Stabilizing Failure in 3D Diffusion Transformers