[Paper] AnchorDream: Repurposing Video Diffusion for Embodiment-Aware Robot Data Synthesis

Published: 1 month ago (December 12, 2025 at 01:59 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.11797v1

Overview

The paper AnchorDream proposes a new way to generate massive, high‑quality robot demonstration data by re‑using off‑the‑shelf video diffusion models. By “anchoring” the diffusion process to a robot’s actual motion renderings, the method produces videos that respect the robot’s physical embodiment, enabling developers to train imitation‑learning policies without the usual bottleneck of costly real‑world data collection.

Key Contributions

Embodiment‑aware diffusion: Introduces a conditioning scheme that ties video diffusion to robot kinematics, preventing unrealistic poses or motions.
Data amplification from few demos: Turns a handful of human‑teleoperated trajectories into thousands of diverse, photorealistic robot‑environment videos.
Zero‑explicit environment modeling: Leverages pretrained diffusion models (e.g., Stable Diffusion) to synthesize realistic backgrounds, objects, and lighting without hand‑crafted simulators.
Empirical gains: Demonstrates up to 36.4 % improvement on simulated benchmarks and nearly 2× performance boost in real‑world robot tasks when training policies on the generated data.
Open‑source pipeline: Provides a modular implementation that can be plugged into existing imitation‑learning stacks (e.g., DAgger, BC, Diffusion‑Policy).

Methodology

Collect a seed dataset – A small set (≈ 10–50) of human‑teleoperated robot trajectories, each paired with a rendered video of the robot’s arm/endeffector motion.
Render motion anchors – For every time step, the robot’s joint angles are visualized as a simple 3‑D mesh overlay (the “anchor”). This anchor is kept unchanged throughout diffusion.
Condition diffusion – A pretrained text‑to‑video diffusion model receives two inputs:
- The motion anchor frames (serving as a spatial‑temporal mask).
- Optional textual prompts describing desired scene variations (e.g., “kitchen counter”, “cluttered desk”).
  The diffusion process fills in the background, objects, and lighting while preserving the anchor’s geometry and motion.
Sample and filter – Thousands of videos are generated, then filtered using a lightweight classifier that checks for kinematic consistency (e.g., no self‑collisions).
Policy training – The filtered synthetic dataset is combined with the original demonstrations to train standard imitation‑learning algorithms (behavior cloning, offline RL).

The key insight is that the diffusion model treats the robot’s rendered skeleton as a hard constraint, so it never “hallucinates” impossible joint configurations—a common failure mode in prior generative approaches.

Results & Findings

Setting	Baseline (real demos only)	+ AnchorDream synthetic data	Relative gain
Simulated pick‑and‑place (30 k steps)	0.62 success rate	0.85 success rate	+36.4 %
Real‑world tabletop rearrangement (5 k steps)	0.41 success rate	0.78 success rate	~+90 %
Generalization to unseen objects	0.35	0.66	+89 %

Visual fidelity: Human evaluators rated the generated videos as “plausible” 93 % of the time.
Embodiment consistency: < 2 % of filtered samples exhibited joint violations, confirming the anchor’s effectiveness.
Training efficiency: Adding synthetic data reduced the number of real‑world rollouts needed to reach a target performance by ~60 %.

Practical Implications

Rapid dataset scaling: Teams can bootstrap a robot learning pipeline with a few teleoperated demos and instantly expand to a rich, varied dataset—cutting data‑collection costs by orders of magnitude.
Sim‑to‑real bridge: Because the synthetic videos are photorealistic and respect robot kinematics, policies trained on them transfer more smoothly to physical hardware, reducing the need for expensive domain‑randomization tricks.
Plug‑and‑play augmentation: The AnchorDream pipeline can be inserted before any imitation‑learning trainer, making it compatible with popular frameworks like PyTorch Lightning, RLlib, or ROS‑based pipelines.
Custom scenario generation: By tweaking textual prompts, developers can synthesize edge‑case environments (e.g., low lighting, clutter) to stress‑test policies before deployment.

Limitations & Future Work

Dependence on a good anchor renderer – The method assumes an accurate 3‑D mesh of the robot; mismatches can propagate errors into the diffusion output.
Computational cost – Generating thousands of high‑resolution videos still requires GPU‑heavy diffusion inference, which may be a bottleneck for very large‑scale projects.
Limited to visual modalities – Current implementation does not synthesize tactile or force feedback data, which are important for many manipulation tasks.
Future directions proposed by the authors include: extending the conditioning to multimodal diffusion (audio, haptics), integrating closed‑loop policy feedback to iteratively refine generated data, and exploring lightweight diffusion alternatives for on‑device synthesis.

Authors

Junjie Ye
Rong Xue
Basile Van Hoorick
Pavel Tokmakov
Muhammad Zubair Irshad
Yue Wang
Vitor Guizilini

Paper Information

arXiv ID: 2512.11797v1
Categories: cs.RO, cs.CV
Published: December 12, 2025
PDF: Download PDF

[Paper] AnchorDream: Repurposing Video Diffusion for Embodiment-Aware Robot Data Synthesis

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Moment-Based 3D Gaussian Splatting: Resolving Volumetric Occlusion with Order-Independent Transmittance

[Paper] V-RGBX: Video Editing with Accurate Controls over Intrinsic Properties

[Paper] Particulate: Feed-Forward 3D Object Articulation

[Paper] Structure From Tracking: Distilling Structure-Preserving Motion for Video Generation