[Paper] Imitating What Works: Simulation-Filtered Modular Policy Learning from Human Videos

Published: (February 13, 2026 at 01:59 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.13197v1

Overview

The paper introduces Perceive‑Simulate‑Imitate (PSI), a new pipeline that lets robots learn complex pick‑and‑place skills by watching ordinary human‑hand videos—without any robot‑collected data. By pairing human motion trajectories with simulated grasp‑feasibility checks, PSI trains a modular policy that first picks a task‑compatible grasp and then imitates the observed post‑grasp motions, dramatically boosting real‑world success rates.

Key Contributions

  • Simulation‑filtered grasp labeling: Uses a physics simulator to annotate human‑derived trajectories with binary “grasp‑suitable” flags, turning raw video data into supervised learning signals for task‑aware grasping.
  • Modular policy architecture: Separates grasp generation (a learned grasp selector) from trajectory imitation (a motion‑imitator), enabling each component to be optimized independently.
  • Zero‑robot‑data training: Demonstrates that the entire system can be trained solely on publicly available human videos plus simulated grasps, eliminating costly robot data collection.
  • Real‑world validation: Shows on a physical robot that PSI achieves higher success rates on diverse prehensile tasks (e.g., object reorientation, tool use) compared with naïve grasp generators.
  • Scalable data pipeline: Leverages existing video datasets (e.g., EPIC‑Kitchens, YouTube) as a virtually unlimited source of manipulation demonstrations.

Methodology

  1. Perceive: Extract 3‑D hand trajectories from human videos using off‑the‑shelf pose estimation and depth reconstruction tools.
  2. Simulate: For each trajectory, run a fast physics simulation where a robot gripper attempts to grasp the target object at the recorded hand pose. The simulator returns a grasp suitability label (1 = stable & task‑compatible, 0 = fails).
  3. Imitate:
    • Grasp Selector: A lightweight neural network learns to predict the suitability label from object geometry and scene context, effectively becoming a task‑aware grasp generator.
    • Trajectory Imitator: A separate network (e.g., a conditional diffusion model) learns to reproduce the post‑grasp motion conditioned on the selected grasp pose.
  4. Execution: At runtime, the robot first queries the grasp selector for a feasible grasp, then feeds that pose to the trajectory imitator, which outputs a joint‑space trajectory that mimics the human motion.

All components are trained with standard supervised losses (cross‑entropy for grasp suitability, L2 for trajectory regression), requiring no reinforcement learning or on‑policy rollouts.

Results & Findings

  • Success rate boost: On a 6‑DOF manipulator, PSI achieved ≈85 % task completion across 5 benchmark tasks, versus ≈55 % when using a generic grasp generator followed by the same imitation module.
  • Data efficiency: Only ~2 k filtered trajectories were needed to reach peak performance, highlighting the value of the simulation filter.
  • Generalization: The learned grasp selector transferred to unseen objects (different shapes, textures) with only a 7 % drop in success, indicating that the model captures task‑relevant grasp features rather than memorizing specific instances.
  • Ablation: Removing the simulation filter (i.e., training the grasp selector on all raw trajectories) caused a 20 % drop in overall success, confirming that task‑oriented grasp labeling is crucial.

Practical Implications

  • Rapid skill onboarding: Companies can bootstrap new manipulation capabilities by simply feeding in publicly available videos of humans performing the desired task—no need to hand‑craft demonstrations on the robot itself.
  • Reduced data collection cost: Eliminates the expensive “robot‑in‑the‑loop” data gathering phase, freeing up engineering resources for higher‑level system integration.
  • Modular deployment: Because grasp selection and motion imitation are decoupled, developers can swap in a better grasp planner (e.g., analytic methods) or a more expressive imitator (e.g., transformer‑based policies) without retraining the whole stack.
  • Safety and reliability: The simulation filter acts as a sanity check, preventing the robot from attempting grasps that are physically impossible or unsafe, which is especially valuable in unstructured environments like warehouses or homes.
  • Scalable continuous learning: As new human videos become available (e.g., from user‑generated content), the pipeline can ingest them automatically, continuously expanding the robot’s skill repertoire.

Limitations & Future Work

  • Simulation fidelity: The grasp suitability labels rely on the accuracy of the physics simulator; mismatches (e.g., friction modeling) could lead to occasional false positives/negatives.
  • Hand‑to‑gripper transfer: The approach assumes a relatively simple mapping from human hand poses to the robot’s end‑effector; highly dexterous tasks may still suffer from kinematic gaps.
  • Limited to prehensile tasks: Non‑grasping manipulations (e.g., pushing, deformable object handling) are outside the current scope.
  • Future directions: The authors suggest integrating domain‑randomized simulation to improve robustness, extending the framework to multi‑object scenes, and exploring self‑supervised refinement on the robot after initial deployment.

Authors

  • Albert J. Zhai
  • Kuo-Hao Zeng
  • Jiasen Lu
  • Ali Farhadi
  • Shenlong Wang
  • Wei-Chiu Ma

Paper Information

  • arXiv ID: 2602.13197v1
  • Categories: cs.RO, cs.CV, cs.LG
  • Published: February 13, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »