[Paper] mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs
Source: arXiv - 2512.15692v1
Overview
The paper introduces mimic‑video, a new class of Video‑Action Models (VAMs) that replace the static vision‑language backbones used in most robot manipulation systems with a large‑scale video foundation model. By learning from video clips that already contain both semantic cues and visual dynamics, the approach lets a lightweight inverse‑dynamics decoder translate those latent video representations into concrete robot actions. The result is a robot controller that learns faster, needs far less expert demonstration data, and generalizes better to new tasks.
Key Contributions
- Video‑first pretraining: Leverages an Internet‑scale video model (e.g., pretrained on YouTube‑8M) to capture both semantics and physical motion, addressing the “physics‑blind” limitation of Vision‑Language‑Action (VLA) models.
- Flow‑matching action decoder: Introduces a flow‑matching based inverse dynamics model (IDM) that maps video‑space latent plans directly to low‑level robot joint commands.
- Sample‑efficiency boost: Demonstrates ~10× reduction in required demonstration data and ~2× faster convergence compared with state‑of‑the‑art VLA pipelines.
- Cross‑domain validation: Provides extensive experiments on both simulated benchmarks (e.g., Meta‑World, RLBench) and real‑world tabletop manipulation setups, achieving new SOTA performance.
- Modular architecture: Decouples high‑level planning (handled by the frozen video encoder) from low‑level control (handled by the trainable IDM), making it easy to swap components or integrate with existing robot stacks.
Methodology
1. Pretrained Video Encoder
- The authors start with a publicly available video foundation model (e.g., a Vision Transformer trained on billions of video clips).
- The encoder outputs a compact latent vector that implicitly encodes what is happening and how objects move over time.
2. Action Decoder as Inverse Dynamics Model
- A lightweight neural network is trained to predict the robot’s next joint velocities (or torques) given two consecutive video latents.
- Training uses a flow‑matching loss: instead of predicting raw actions, the decoder learns to match the latent “flow” between video frames, which aligns naturally with physical dynamics.
3. Training Pipeline
- Collect a modest set of teleoperated demonstrations (≈ 1–2 hours of robot time).
- For each demonstration, extract the corresponding video clip, feed it through the frozen encoder, and train the IDM to reproduce the recorded actions.
- No additional language supervision is required; the video encoder already carries semantic knowledge from its pretraining.
4. Inference
- At test time, a high‑level goal (e.g., “pick up the red block”) is translated into a goal video (either via a generative video model or a short example clip).
- The encoder produces the goal latent; the IDM rolls out actions that drive the robot’s current latent toward the goal latent, effectively “following” the visual plan.
Results & Findings
| Setting | Metric (higher is better) | Mimic‑Video | Prior VLA Baseline |
|---|---|---|---|
| Simulated pick‑and‑place (Meta‑World) | Success Rate | 92 % | 71 % |
| Real‑world block stacking (4‑step) | Success Rate | 84 % | 58 % |
| Demonstrations needed for 80 % success | # Episodes | ≈ 30 | ≈ 300 |
| Training wall‑clock time to convergence | Hours | 4 | 8 |
- Sample efficiency: Mimic‑video reaches target performance with roughly one‑tenth the amount of expert data.
- Speed of learning: Convergence is twice as fast, thanks to the strong priors baked into the video encoder.
- Generalization: The model successfully transfers to unseen object shapes and lighting conditions without additional fine‑tuning, indicating that the video latent captures robust physical cues.
Practical Implications
- Lower data collection costs: Companies can bootstrap robot learning pipelines with a few hours of teleoperation instead of weeks of data gathering.
- Plug‑and‑play control stacks: Because the video encoder is frozen, developers can swap in any off‑the‑shelf video foundation model (e.g., CLIP‑Video, Flamingo‑Video) without retraining the whole system.
- Rapid prototyping of new tasks: Providing a short goal video (or a synthetic clip) is enough to define a new manipulation behavior, enabling “program‑by‑example” workflows for non‑experts.
- Better safety and predictability: The IDM learns an explicit inverse dynamics mapping, which can be inspected, regularized, or combined with classic model‑based controllers for tighter safety guarantees.
- Cross‑modal extensions: The same latent space can be used for language‑to‑video retrieval, opening doors to multimodal instruction following where a user simply describes a task and the system fetches a matching video plan.
Limitations & Future Work
- Dependence on video encoder quality: If the pretrained video model lacks coverage of certain domains (e.g., industrial tooling), the latent representation may miss critical dynamics.
- Goal video acquisition: The current pipeline assumes a goal video is available; generating or retrieving appropriate clips in the wild remains an open challenge.
- Real‑time latency: Running a large video encoder on‑board a robot can introduce inference lag; future work should explore efficient distillation or edge‑optimized encoders.
- Complex multi‑object interactions: While the method handles single‑object manipulation well, scaling to densely cluttered scenes with many interacting bodies will require richer latent dynamics or hierarchical planning.
Overall, mimic‑video demonstrates that a video‑centric pretraining strategy can dramatically cut the data and time barriers for robot learning, offering a practical path toward more adaptable, data‑efficient manipulation systems.
Authors
- Jonas Pai
- Liam Achenbach
- Victoriano Montesinos
- Benedek Forrai
- Oier Mees
- Elvis Nava
Paper Information
- arXiv ID: 2512.15692v1
- Categories: cs.RO, cs.AI, cs.CV, cs.LG
- Published: December 17, 2025
- PDF: Download PDF