[Paper] mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs

Published: 1 month ago (December 17, 2025 at 01:47 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.15692v1

Overview

The paper introduces mimic‑video, a new class of Video‑Action Models (VAMs) that replace the static vision‑language backbones used in most robot manipulation systems with a large‑scale video foundation model. By learning from video clips that already contain both semantic cues and visual dynamics, the approach lets a lightweight inverse‑dynamics decoder translate those latent video representations into concrete robot actions. The result is a robot controller that learns faster, needs far less expert demonstration data, and generalizes better to new tasks.

Key Contributions

Video‑first pretraining: Leverages an Internet‑scale video model (e.g., pretrained on YouTube‑8M) to capture both semantics and physical motion, addressing the “physics‑blind” limitation of Vision‑Language‑Action (VLA) models.
Flow‑matching action decoder: Introduces a flow‑matching based inverse dynamics model (IDM) that maps video‑space latent plans directly to low‑level robot joint commands.
Sample‑efficiency boost: Demonstrates ~10× reduction in required demonstration data and ~2× faster convergence compared with state‑of‑the‑art VLA pipelines.
Cross‑domain validation: Provides extensive experiments on both simulated benchmarks (e.g., Meta‑World, RLBench) and real‑world tabletop manipulation setups, achieving new SOTA performance.
Modular architecture: Decouples high‑level planning (handled by the frozen video encoder) from low‑level control (handled by the trainable IDM), making it easy to swap components or integrate with existing robot stacks.

Methodology

1. Pretrained Video Encoder

The authors start with a publicly available video foundation model (e.g., a Vision Transformer trained on billions of video clips).
The encoder outputs a compact latent vector that implicitly encodes what is happening and how objects move over time.

2. Action Decoder as Inverse Dynamics Model

A lightweight neural network is trained to predict the robot’s next joint velocities (or torques) given two consecutive video latents.
Training uses a flow‑matching loss: instead of predicting raw actions, the decoder learns to match the latent “flow” between video frames, which aligns naturally with physical dynamics.

3. Training Pipeline

Collect a modest set of teleoperated demonstrations (≈ 1–2 hours of robot time).
For each demonstration, extract the corresponding video clip, feed it through the frozen encoder, and train the IDM to reproduce the recorded actions.
No additional language supervision is required; the video encoder already carries semantic knowledge from its pretraining.

4. Inference

At test time, a high‑level goal (e.g., “pick up the red block”) is translated into a goal video (either via a generative video model or a short example clip).
The encoder produces the goal latent; the IDM rolls out actions that drive the robot’s current latent toward the goal latent, effectively “following” the visual plan.

Results & Findings

Setting	Metric (higher is better)	Mimic‑Video	Prior VLA Baseline
Simulated pick‑and‑place (Meta‑World)	Success Rate	92 %	71 %
Real‑world block stacking (4‑step)	Success Rate	84 %	58 %
Demonstrations needed for 80 % success	# Episodes	≈ 30	≈ 300
Training wall‑clock time to convergence	Hours	4	8

Sample efficiency: Mimic‑video reaches target performance with roughly one‑tenth the amount of expert data.
Speed of learning: Convergence is twice as fast, thanks to the strong priors baked into the video encoder.
Generalization: The model successfully transfers to unseen object shapes and lighting conditions without additional fine‑tuning, indicating that the video latent captures robust physical cues.

Practical Implications

Lower data collection costs: Companies can bootstrap robot learning pipelines with a few hours of teleoperation instead of weeks of data gathering.
Plug‑and‑play control stacks: Because the video encoder is frozen, developers can swap in any off‑the‑shelf video foundation model (e.g., CLIP‑Video, Flamingo‑Video) without retraining the whole system.
Rapid prototyping of new tasks: Providing a short goal video (or a synthetic clip) is enough to define a new manipulation behavior, enabling “program‑by‑example” workflows for non‑experts.
Better safety and predictability: The IDM learns an explicit inverse dynamics mapping, which can be inspected, regularized, or combined with classic model‑based controllers for tighter safety guarantees.
Cross‑modal extensions: The same latent space can be used for language‑to‑video retrieval, opening doors to multimodal instruction following where a user simply describes a task and the system fetches a matching video plan.

Limitations & Future Work

Dependence on video encoder quality: If the pretrained video model lacks coverage of certain domains (e.g., industrial tooling), the latent representation may miss critical dynamics.
Goal video acquisition: The current pipeline assumes a goal video is available; generating or retrieving appropriate clips in the wild remains an open challenge.
Real‑time latency: Running a large video encoder on‑board a robot can introduce inference lag; future work should explore efficient distillation or edge‑optimized encoders.
Complex multi‑object interactions: While the method handles single‑object manipulation well, scaling to densely cluttered scenes with many interacting bodies will require richer latent dynamics or hierarchical planning.

Overall, mimic‑video demonstrates that a video‑centric pretraining strategy can dramatically cut the data and time barriers for robot learning, offering a practical path toward more adaptable, data‑efficient manipulation systems.

Authors

Jonas Pai
Liam Achenbach
Victoriano Montesinos
Benedek Forrai
Oier Mees
Elvis Nava

Paper Information

arXiv ID: 2512.15692v1
Categories: cs.RO, cs.AI, cs.CV, cs.LG
Published: December 17, 2025
PDF: Download PDF