[Paper] DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation

Published: 1 day ago (April 22, 2026 at 01:59 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2604.20841v1

Overview

The paper introduces DeVI (Dexterous Video Imitation), a framework that turns text‑conditioned synthetic videos of human‑object interactions into physically plausible control policies for dexterous robotic hands. By bridging the gap between 2‑D generative video cues and 3‑D physics simulation, DeVI enables zero‑shot imitation of complex manipulations—something that traditional motion‑capture pipelines struggle to capture.

Key Contributions

Video‑first imitation pipeline: Uses only synthetic videos (no 3‑D motion capture) as demonstration data for learning dexterous hand‑object control.
Hybrid tracking reward: Combines 3‑D human pose tracking with robust 2‑D object tracking to compensate for the limited physical fidelity of generated videos.
Zero‑shot generalization: Handles previously unseen objects and interaction types purely from text prompts, eliminating the need for task‑specific demonstration collection.
Empirical superiority: Outperforms state‑of‑the‑art methods that rely on high‑quality 3‑D demonstrations, especially in fine‑grained hand‑object contact modeling.
Scalable to multi‑object scenes & diverse actions: Demonstrates that a single video‑driven planner can orchestrate complex sequences involving several objects and varied manipulation verbs.

Methodology

Synthetic video generation – A text‑to‑video diffusion model (e.g., Stable Diffusion Video) is prompted with a natural‑language description of the desired manipulation (e.g., “pick up a red mug”). The model outputs a short, realistic‑looking clip of a human hand interacting with the target object.
3‑D human pose extraction – Off‑the‑shelf pose estimators (e.g., VIBE, SMPL‑X) recover a coarse 3‑D skeleton from each video frame. This provides a rough trajectory for the hand’s joints.
2‑D object tracking – A dedicated object tracker (e.g., SiamMask) follows the target object’s pixel mask throughout the clip, yielding a dense 2‑D trajectory that is less sensitive to depth errors.
Hybrid tracking reward – During reinforcement learning in a physics simulator, the agent receives a reward that penalizes deviation from both the 3‑D joint trajectory and the 2‑D object mask trajectory. The 2‑D term acts as a corrective signal when the 3‑D pose is noisy.
Policy learning – A model‑free RL algorithm (e.g., PPO) optimizes a dexterous hand policy to maximize the hybrid reward while respecting physics constraints (contact forces, joint limits). No explicit inverse kinematics or trajectory smoothing is required.

The overall pipeline is fully automated: a user writes a textual command, the system generates a video, extracts tracking cues, and trains a control policy that can be deployed on a simulated or real robotic hand.

Results & Findings

Metric	DeVI vs. 3‑D‑Demo Baselines	Observation
Success rate on unseen objects (e.g., novel mugs, tools)	+23 % absolute improvement	Video‑driven cues capture subtle hand‑object contact patterns that 3‑D demos miss.
Contact fidelity (average penetration depth)	‑0.4 mm (lower)	Hybrid reward reduces interpenetration, leading to more realistic grasps.
Multi‑object task completion (pick‑place‑stack)	+18 % success	The 2‑D object tracker helps maintain consistency across object switches.
Training efficiency (wall‑clock hours)	Comparable to baselines	No extra data collection overhead; video generation is cheap and parallelizable.

Qualitatively, policies learned with DeVI exhibit smooth finger articulation, proper wrist orientation, and adaptive grasp forces that mirror the motions seen in the synthetic videos, even when the target objects differ in shape or texture from the training set.

Practical Implications

Rapid prototyping of manipulation skills – Engineers can specify a new task with a single sentence and obtain a ready‑to‑run policy without labor‑intensive motion‑capture sessions.
Scalable dataset creation – Synthetic video generators can produce virtually unlimited diverse HOI (human‑object interaction) clips, feeding continuous improvement loops for robotic dexterity.
Cross‑domain transfer – Because the policy is learned in simulation with physics constraints, the resulting controller can be fine‑tuned on real hardware with minimal domain randomization, accelerating deployment on commercial robotic hands (e.g., Shadow Dexterous Hand, Allegro).
Enhanced human‑robot collaboration – Systems that need to anticipate or mirror human actions (e.g., collaborative assembly, tele‑operation assistance) can leverage the same video‑based pipeline to infer plausible hand trajectories from visual cues.
Cost reduction – Eliminates the need for expensive mocap rigs, high‑speed cameras, and manual annotation pipelines, making advanced dexterous manipulation accessible to startups and research labs with limited budgets.

Limitations & Future Work

Physical realism of generated videos – Current diffusion models do not guarantee accurate depth or contact physics, which can still introduce bias in the hybrid reward.
Sim‑to‑real gap – While the authors report promising simulation results, transferring the learned policies to real hardware may require additional calibration and safety checks.
Object diversity bound by training data – The video generator’s object catalog is limited to what it has seen during pre‑training; truly novel categories may produce unrealistic clips.
Computational cost of RL training – Although data collection is cheap, policy optimization still demands substantial GPU/CPU resources for each new task.

Future directions include integrating physics‑aware video generation (e.g., conditioning on simulated dynamics), leveraging few‑shot real‑world fine‑tuning, and extending the framework to whole‑body manipulation scenarios (e.g., using both hands or incorporating torso motion).

Authors

Hyeonwoo Kim
Jeonghwan Kim
Kyungwon Cho
Hanbyul Joo

Paper Information

arXiv ID: 2604.20841v1
Categories: cs.CV
Published: April 22, 2026
PDF: Download PDF

[Paper] DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Seeing Fast and Slow: Learning the Flow of Time in Videos

[Paper] Seeing Without Eyes: 4D Human-Scene Understanding from Wearable IMUs

[Paper] Context Unrolling in Omni Models

[Paper] Vista4D: Video Reshooting with 4D Point Clouds