[Paper] RoboMirror: Understand Before You Imitate for Video to Humanoid Locomotion

Published: (December 29, 2025 at 12:59 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.23649v1

Overview

RoboMirror is a pioneering framework that lets a humanoid robot learn to walk and act directly from raw video—whether it’s a first‑person (egocentric) clip or a third‑person recording—without the traditional intermediate steps of motion‑capture retargeting or text‑to‑action translation. By combining large‑scale vision‑language models (VLMs) with a diffusion‑based control policy, the system first extracts a visual motion intent from the video and then generates physically plausible, semantically aligned locomotion. This bridges a long‑standing gap between visual understanding and robot control, opening new avenues for telepresence, remote supervision, and intuitive robot programming.

Key Contributions

  • First “understand‑before‑imitate” video‑to‑humanoid pipeline that bypasses explicit pose reconstruction and retargeting.
  • Vision‑language intent distillation: uses pretrained VLMs to convert raw video streams into compact motion‑intent embeddings.
  • Diffusion‑based locomotion policy: conditioned on intent embeddings, it produces continuous, physically consistent joint commands for a full‑body humanoid.
  • Real‑time telepresence demo: egocentric video from a wearable camera drives a remote humanoid with ~80 % lower control latency compared to conventional third‑person pipelines.
  • Quantitative gains: 3.7 % higher task‑success rate on benchmark navigation and obstacle‑avoidance scenarios versus state‑of‑the‑art baselines.
  • Open‑source implementation (code, pretrained models, and a ROS‑compatible driver) released for reproducibility and community extension.

Methodology

  1. Video Ingestion – The system accepts either egocentric or third‑person RGB video at 30 fps. No depth or skeleton data is required.
  2. Intent Extraction – A large‑scale vision‑language model (e.g., CLIP‑ViT or BLIP) processes short video clips (≈1 s) and outputs a motion‑intent vector that captures high‑level semantics such as “walk forward”, “turn left”, “step over obstacle”.
  3. Diffusion Policy – A conditional diffusion model, trained on a large corpus of simulated humanoid trajectories, takes the intent vector and iteratively denoises a latent action sequence into joint torque commands that respect the robot’s dynamics and balance constraints.
  4. Control Loop – The generated torque commands are streamed to the robot’s low‑level controller at 100 Hz. A lightweight feedback filter corrects minor drift, but the core behavior is driven entirely by the video‑derived intent.
  5. Training Regime – The diffusion policy is trained offline using reinforcement‑learning‑style reward shaping (stability, foot‑placement accuracy, task completion) on a physics simulator (MuJoCo/IsaacGym). The intent encoder is frozen, leveraging the generalization power of pretrained VLMs.

Results & Findings

MetricRoboMirrorBaseline (text‑to‑motion)Baseline (pose‑mimic)
Control latency (ms)120600540
Task success rate (%)87.383.679.2
Average energy consumption (J)1.121.281.31
Qualitative realism (user study)4.6/53.9/53.5/5
  • Latency reduction: By eliminating pose extraction and retargeting, the end‑to‑end pipeline runs in ~120 ms, enabling near‑real‑time teleoperation.
  • Higher success: The intent‑driven policy better respects scene semantics (e.g., “step over” vs. “walk through”) leading to a 3.7 % boost in task completion.
  • Energy efficiency: More natural motions reduce unnecessary joint torques, saving power—critical for battery‑operated humanoids.
  • User perception: Participants rated RoboMirror’s motions as more “human‑like” and “intuitive to control” than the baselines.

Practical Implications

  • Telepresence & Remote Work: Workers equipped with a head‑mounted camera can control a humanoid in hazardous or inaccessible environments (e.g., nuclear plants, disaster zones) with minimal lag and no need for specialized motion‑capture rigs.
  • Rapid Prototyping of Behaviors: Developers can demonstrate a desired locomotion style via a short video clip, and the robot will replicate it, dramatically shortening the iteration cycle for service‑robot applications.
  • Cross‑Domain Transfer: Since the intent encoder is language‑agnostic, the same pipeline can be reused for different robot morphologies (e.g., bipedal vs. quadrupedal) by swapping the diffusion policy’s dynamics model.
  • Reduced Engineering Overhead: No need to hand‑craft pose‑retargeting pipelines or maintain large text‑action vocabularies; the system leverages off‑the‑shelf VLMs that are continuously improving.
  • Integration with Existing Stacks: The released ROS node subscribes to /camera/image_raw and publishes joint commands on /humanoid_controller/command, making it drop‑in compatible with most research‑grade humanoid platforms (e.g., NASA Valkyrie, Boston Dynamics Atlas).

Limitations & Future Work

  • Reliance on VLM Generalization: The intent extraction quality degrades when the video contains unusual viewpoints or heavy occlusions; fine‑tuning on domain‑specific data could help.
  • Simulation‑to‑Real Gap: While the diffusion policy is trained in simulation, transferring to hardware still requires careful calibration of dynamics parameters and safety checks.
  • Limited Temporal Horizon: Current intent vectors summarize ~1 s of video; longer‑term planning (e.g., navigating complex mazes) needs hierarchical intent modeling.
  • Scalability to Multi‑Agent Scenarios: Extending the framework to coordinate multiple robots from a single video feed remains an open challenge.

RoboMirror demonstrates that visual understanding can be the primary driver of humanoid locomotion, shifting the paradigm from “copy the pose” to “interpret the intent”. As VLMs continue to improve, we can expect even richer, more reliable robot behaviors driven directly by the videos we already capture every day.

Authors

  • Zhe Li
  • Cheng Chi
  • Yangyang Wei
  • Boan Zhu
  • Tao Huang
  • Zhenguo Sun
  • Yibo Peng
  • Pengwei Wang
  • Zhongyuan Wang
  • Fangzhou Liu
  • Chang Xu
  • Shanghang Zhang

Paper Information

  • arXiv ID: 2512.23649v1
  • Categories: cs.RO, cs.CV
  • Published: December 29, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »