[Paper] Do as I Do: Dexterous Manipulation Data from Everyday Human Videos

Published: 1 day ago (June 17, 2026 at 01:57 PM EDT)

2 min read

Source: arXiv

Source: arXiv - 2606.19333v1

Overview

How can we scalably generate data for robotic manipulation, especially on human-like platforms such as dexterous multi-fingered hands? Learning from human videos has recently emerged as a likely answer to this question. However, difficulties in estimating hand-object interaction and crossing the human-to-robot embodiment gap have hindered the adoption of abundant monocular RGB-only human videos as the primary source of robot manipulation data. In this work, we present DO AS I DO, an algorithm to reconstruct and retarget monocular RGB human videos to multi-fingered dexterous robotic hands. DO AS I DO reconstructs hand-object interactions from various egocentric and exocentric in-the-wild video sources. The algorithm then retargets these hand-object interaction estimates into a sequence of actions executable in the real world, yielding robot-complete manipulation data from disparate human videos. Overall, DO AS I DO outperforms previous state of the art in estimating hand-object interactions and extracting dexterous manipulation trajectories from RGB videos, as we show in experiments on datasets with ground truths and on a dataset of video clips collected online. Our experiments enable us to propose an efficacy playbook for practitioners collecting human data for manipulation.

Key Contributions

This paper presents research in the following areas:

cs.RO
cs.CV

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.RO.

Authors

Bhawna Paliwal
Haritheja Etukuru
William Liang
Pieter Abbeel
Nur Muhammad Mahi Shafiullah
Jitendra Malik

Paper Information

arXiv ID: 2606.19333v1
Categories: cs.RO, cs.CV
Published: June 17, 2026
PDF: Download PDF

[Paper] Do as I Do: Dexterous Manipulation Data from Everyday Human Videos

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] Native Active Perception as Reasoning for Omni-Modal Understanding

[Paper] Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games

[Paper] Reference-Driven Multi-Speaker Audio Scene Generation from In-the-Wild Priors

[Paper] NeuMesh++: Towards Versatile and Efficient Volumetric Editing with Disentangled Neural Mesh-based Implicit Field