[Paper] Dexterous Manipulation Policies from RGB Human Videos via 4D Hand-Object Trajectory Reconstruction

Published: (February 9, 2026 at 01:56 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.09013v1

Overview

The paper introduces VIDEOMANIP, a novel framework that teaches dexterous multi‑finger robot hands to manipulate objects using only ordinary RGB videos of humans performing the task. By reconstructing 4‑D (3‑D space + time) hand‑object trajectories from monocular footage, the authors bypass the need for expensive motion‑capture suits or specialized sensors, opening the door to scalable, vision‑only robot learning.

Key Contributions

  • Device‑free data collection – learns manipulation policies directly from off‑the‑shelf RGB videos, eliminating wearables and custom rigs.
  • 4‑D hand‑object trajectory reconstruction – combines state‑of‑the‑art human hand pose estimation and object mesh recovery to produce temporally coherent robot‑ready trajectories.
  • Contact‑aware retargeting – optimizes the reconstructed motions for robot hands by enforcing realistic hand‑object contact and interaction‑centric grasp modeling.
  • Demonstration synthesis – generates a diverse set of training trajectories from a single video, dramatically expanding the data without extra human effort.
  • Real‑world validation – achieves >60 % success on seven real‑world tasks with the LEAP Hand, outperforming prior retargeting pipelines by ~16 %.

Methodology

  1. Video Ingestion – A standard monocular video of a person manipulating an object is fed into a vision pipeline.
  2. Human Hand Pose & Object Mesh Recovery – Off‑the‑shelf CV models (e.g., MANO‑based hand pose estimators, neural implicit object reconstruction) predict a 3‑D hand skeleton and a dense mesh of the object for every frame.
  3. 4‑D Trajectory Assembly – The per‑frame estimates are temporally smoothed to produce a continuous hand‑object trajectory (position, orientation, and shape over time).
  4. Retargeting to Robot Hand – The human hand trajectory is mapped onto the robot’s kinematic chain. A contact‑optimization step adjusts joint angles so that the robot fingertips make plausible contact with the object, respecting the robot’s geometry and torque limits.
  5. Demonstration Synthesis – Small perturbations (e.g., varying grasp offsets, object pose jitter) are applied to the retargeted trajectory, yielding many synthetic demonstrations from the original video.
  6. Policy Learning – The synthesized dataset trains a reinforcement‑learning or imitation‑learning policy (e.g., PPO with a vision‑based encoder) that outputs joint commands for the robot hand.

The whole pipeline runs without any physical robot data; the only “real” input is the RGB video.

Results & Findings

SettingSuccess RateNotes
Simulation (Inspire Hand)70.25 % across 20 objectsDemonstrates that the reconstructed trajectories are sufficient for learning robust grasps in a physics engine.
Real‑world (LEAP Hand)62.86 % average over 7 tasksOutperforms a baseline that directly retargets human motion by 15.87 %.
AblationRemoving contact optimization drops success by ~12 %Highlights the importance of interaction‑centric grasp modeling.
Data EfficiencyOne video → ~50 synthetic demos → comparable performance to 10× real robot demosShows the power of the synthesis step.

These numbers indicate that a single, casually recorded video can bootstrap a functional dexterous manipulation policy.

Practical Implications

  • Scalable Dataset Creation – Companies can harvest existing YouTube or internal footage to build large, diverse manipulation datasets without costly data‑collection rigs.
  • Rapid Prototyping – Engineers can test new hand designs or control algorithms by simply recording a few demonstration videos, cutting weeks off the development cycle.
  • Cross‑Domain Transfer – Because the pipeline works on generic RGB footage, policies can be trained on objects that are hard to instrument (e.g., fragile items) and then transferred to real robots.
  • Lower Barrier for Startups – Small robotics teams without access to motion‑capture labs can still train high‑DOF hands, democratizing dexterous manipulation research.

Limitations & Future Work

  • Vision Accuracy Dependency – The quality of hand pose and object mesh estimates directly limits policy performance; occlusions or fast motions still cause errors.
  • Simulation‑to‑Real Gap – While real‑world tests are promising, the approach relies on accurate physics simulation for policy pre‑training; further domain‑randomization may be needed for broader robustness.
  • Hand Model Generalization – The current retargeting assumes a specific robot hand geometry (e.g., LEAP, Inspire). Extending to heterogeneous hand designs will require more flexible kinematic mapping.
  • Complex Tasks – The evaluated tasks involve relatively simple pick‑and‑place or re‑orientation. Future work could explore in‑hand manipulation, tool use, or multi‑object interactions.

If you’re curious to see the system in action, check out the project videos at videomanip.github.io. The code and datasets are slated for open‑source release, which could make vision‑only dexterous learning a new standard in robotics.

Authors

  • Hongyi Chen
  • Tony Dong
  • Tiancheng Wu
  • Liquan Wang
  • Yash Jangir
  • Yaru Niu
  • Yufei Ye
  • Homanga Bharadhwaj
  • Zackory Erickson
  • Jeffrey Ichnowski

Paper Information

  • arXiv ID: 2602.09013v1
  • Categories: cs.RO, cs.CV
  • Published: February 9, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »