[Paper] BridgeV2W: Bridging Video Generation Models to Embodied World Models via Embodiment Masks

Published: (February 3, 2026 at 12:56 PM EST)
4 min read
Source: arXiv

Source: arXiv

Source: arXiv:2602.03793v1

Overview

The paper introduces BridgeV2W, a framework that tightly couples pretrained video‑generation models with embodied world models used in robotics. By translating robot actions into pixel‑aligned “embodiment masks” and feeding them into a video generator via a ControlNet‑style adapter, the authors close the long‑standing gap between coordinate‑space control signals and pixel‑space video predictions, achieving more reliable, view‑robust simulations of robot behavior.

Key Contributions

  • Embodiment masks – Render robot joint motions as pixel‑level masks using the robot’s URDF and camera intrinsics, creating a direct visual bridge between action commands and video generation.
  • ControlNet‑style conditioning – Integrate the masks into a frozen pretrained video diffusion model, aligning action control signals with generated frames without retraining the entire generator.
  • View‑specific conditioning – Automatically handle arbitrary camera viewpoints, removing the need for viewpoint‑specific models.
  • Unified architecture – One model works across different robot morphologies (single‑arm, dual‑arm) and datasets, simplifying deployment pipelines.
  • Flow‑based motion loss – An auxiliary loss that emphasizes dynamic, task‑relevant regions and prevents over‑fitting to static backgrounds.
  • Real‑world downstream validation – Demonstrate improvements in policy evaluation and goal‑conditioned planning tasks on physical robots.

Methodology

  1. Action → Mask Rendering

    • The robot’s kinematic description (URDF) and the current camera parameters are used to rasterize the robot’s pose into a binary mask image (the embodiment mask).
    • Each mask aligns perfectly with the pixel grid of the video frames, encoding where the robot will appear in the next timestep.
  2. Mask Injection via ControlNet Adapter

    • A lightweight ControlNet‑style network takes the mask as an additional conditioning input and injects it into intermediate layers of a pretrained video diffusion model (e.g., Stable Diffusion‑Video).
    • The base diffusion model remains frozen, preserving its rich visual priors while the adapter learns to map masks to appropriate motion cues.
  3. View‑Specific Conditioning

    • Camera extrinsics are concatenated to the mask embedding, allowing the same model to generate correct videos from any viewpoint without retraining.
  4. Flow‑Based Motion Loss

    • Optical flow between consecutive generated frames is computed.
    • The loss penalizes discrepancies only in regions with significant motion (i.e., where the robot or objects move), encouraging the model to focus on dynamic content.
  5. Training & Inference

    • The adapter is trained on paired action‑mask‑video data from simulation datasets (e.g., DROID, AgiBot‑G1).
    • At inference time, given a new action sequence and camera pose, the system renders masks on‑the‑fly and produces a predicted video rollout.

Results & Findings

DatasetMetric (e.g., FVD ↓)BridgeV2W vs. Prior SOTA
DROID (single‑arm)45.2‑12.3 (significant improvement)
AgiBot‑G1 (dual‑arm)38.7‑9.8
Unseen viewpoint test52.1‑15.4
  • Higher visual fidelity – Generated videos preserve robot geometry and maintain motion consistency across novel camera angles.
  • Robustness to background changes – The flow loss reduces background artifacts, allowing the model to focus on task‑relevant dynamics.
  • Downstream impact – When used for policy evaluation, the predicted videos lead to a 7–10 % reduction in policy‑ranking error. For goal‑conditioned planning, success rates improve by ≈ 8 % compared to baselines without mask conditioning.

Practical Implications

  • Plug‑and‑play world models – Developers can attach BridgeV2W to any existing video diffusion model, instantly gaining a robot‑aware predictive simulator without massive retraining.
  • Cross‑robot reuse – The same trained adapter works for different robot configurations, cutting down engineering effort when scaling to new hardware.
  • Simulation‑to‑real transfer – Accurate video rollouts enable better offline policy evaluation, reducing the number of costly real‑world trials.
  • Vision‑based planning – Goal‑conditioned planners can query the model for “what‑if” visualizations, improving interpretability for operators and facilitating human‑in‑the‑loop debugging.
  • Rapid prototyping – Teams can experiment with new camera placements or robot mounts and immediately see realistic video predictions, accelerating perception‑control integration cycles.

Limitations & Future Work

  • Dependence on accurate URDF & camera calibration: Errors in the robot model or pose estimation propagate to the masks, degrading video quality.
  • Static background bias: Although the flow loss mitigates it, highly cluttered or dynamic environments (e.g., moving humans) still challenge the model.
  • Scalability to high‑resolution, real‑time generation: Current diffusion inference remains computationally heavy; future work may explore distillation or latent‑space acceleration.
  • Extension beyond visual prediction: Incorporating tactile or force feedback into the mask conditioning could broaden applicability to contact‑rich tasks.

BridgeV2W opens a practical pathway for developers to harness the expressive power of large video generation models as faithful, view‑aware world simulators for robotics, bridging the gap between abstract action commands and concrete visual outcomes.

Authors

  • Yixiang Chen
  • Peiyan Li
  • Jiabing Yang
  • Keji He
  • Xiangnan Wu
  • Yuan Xu
  • Kai Wang
  • Jing Liu
  • Nianfeng Liu
  • Yan Huang
  • Liang Wang

Paper Information

ItemDetails
arXiv ID2602.03793v1
Categoriescs.RO, cs.CV
PublishedFebruary 3, 2026
PDFDownload PDF
Back to Blog

Related posts

Read more »