[Paper] AnchorDream: Repurposing Video Diffusion for Embodiment-Aware Robot Data Synthesis

Published: (December 12, 2025 at 01:59 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.11797v1

Overview

The paper AnchorDream proposes a new way to generate massive, high‑quality robot demonstration data by re‑using off‑the‑shelf video diffusion models. By “anchoring” the diffusion process to a robot’s actual motion renderings, the method produces videos that respect the robot’s physical embodiment, enabling developers to train imitation‑learning policies without the usual bottleneck of costly real‑world data collection.

Key Contributions

  • Embodiment‑aware diffusion: Introduces a conditioning scheme that ties video diffusion to robot kinematics, preventing unrealistic poses or motions.
  • Data amplification from few demos: Turns a handful of human‑teleoperated trajectories into thousands of diverse, photorealistic robot‑environment videos.
  • Zero‑explicit environment modeling: Leverages pretrained diffusion models (e.g., Stable Diffusion) to synthesize realistic backgrounds, objects, and lighting without hand‑crafted simulators.
  • Empirical gains: Demonstrates up to 36.4 % improvement on simulated benchmarks and nearly performance boost in real‑world robot tasks when training policies on the generated data.
  • Open‑source pipeline: Provides a modular implementation that can be plugged into existing imitation‑learning stacks (e.g., DAgger, BC, Diffusion‑Policy).

Methodology

  1. Collect a seed dataset – A small set (≈ 10–50) of human‑teleoperated robot trajectories, each paired with a rendered video of the robot’s arm/endeffector motion.
  2. Render motion anchors – For every time step, the robot’s joint angles are visualized as a simple 3‑D mesh overlay (the “anchor”). This anchor is kept unchanged throughout diffusion.
  3. Condition diffusion – A pretrained text‑to‑video diffusion model receives two inputs:
    • The motion anchor frames (serving as a spatial‑temporal mask).
    • Optional textual prompts describing desired scene variations (e.g., “kitchen counter”, “cluttered desk”).
      The diffusion process fills in the background, objects, and lighting while preserving the anchor’s geometry and motion.
  4. Sample and filter – Thousands of videos are generated, then filtered using a lightweight classifier that checks for kinematic consistency (e.g., no self‑collisions).
  5. Policy training – The filtered synthetic dataset is combined with the original demonstrations to train standard imitation‑learning algorithms (behavior cloning, offline RL).

The key insight is that the diffusion model treats the robot’s rendered skeleton as a hard constraint, so it never “hallucinates” impossible joint configurations—a common failure mode in prior generative approaches.

Results & Findings

SettingBaseline (real demos only)+ AnchorDream synthetic dataRelative gain
Simulated pick‑and‑place (30 k steps)0.62 success rate0.85 success rate+36.4 %
Real‑world tabletop rearrangement (5 k steps)0.41 success rate0.78 success rate~+90 %
Generalization to unseen objects0.350.66+89 %
  • Visual fidelity: Human evaluators rated the generated videos as “plausible” 93 % of the time.
  • Embodiment consistency: < 2 % of filtered samples exhibited joint violations, confirming the anchor’s effectiveness.
  • Training efficiency: Adding synthetic data reduced the number of real‑world rollouts needed to reach a target performance by ~60 %.

Practical Implications

  • Rapid dataset scaling: Teams can bootstrap a robot learning pipeline with a few teleoperated demos and instantly expand to a rich, varied dataset—cutting data‑collection costs by orders of magnitude.
  • Sim‑to‑real bridge: Because the synthetic videos are photorealistic and respect robot kinematics, policies trained on them transfer more smoothly to physical hardware, reducing the need for expensive domain‑randomization tricks.
  • Plug‑and‑play augmentation: The AnchorDream pipeline can be inserted before any imitation‑learning trainer, making it compatible with popular frameworks like PyTorch Lightning, RLlib, or ROS‑based pipelines.
  • Custom scenario generation: By tweaking textual prompts, developers can synthesize edge‑case environments (e.g., low lighting, clutter) to stress‑test policies before deployment.

Limitations & Future Work

  • Dependence on a good anchor renderer – The method assumes an accurate 3‑D mesh of the robot; mismatches can propagate errors into the diffusion output.
  • Computational cost – Generating thousands of high‑resolution videos still requires GPU‑heavy diffusion inference, which may be a bottleneck for very large‑scale projects.
  • Limited to visual modalities – Current implementation does not synthesize tactile or force feedback data, which are important for many manipulation tasks.
  • Future directions proposed by the authors include: extending the conditioning to multimodal diffusion (audio, haptics), integrating closed‑loop policy feedback to iteratively refine generated data, and exploring lightweight diffusion alternatives for on‑device synthesis.

Authors

  • Junjie Ye
  • Rong Xue
  • Basile Van Hoorick
  • Pavel Tokmakov
  • Muhammad Zubair Irshad
  • Yue Wang
  • Vitor Guizilini

Paper Information

  • arXiv ID: 2512.11797v1
  • Categories: cs.RO, cs.CV
  • Published: December 12, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »