[Paper] Motion Attribution for Video Generation
Source: arXiv - 2601.08828v1
Overview
The paper introduces Motive (MOTIon attribution for Video gEneration), a gradient‑based framework that pinpoints which training clips most influence a video model’s motion rather than its static appearance. By isolating temporal dynamics, Motive lets researchers and engineers understand, evaluate, and curate data that directly improves motion quality in modern text‑to‑video generators.
Key Contributions
- Motion‑centric attribution: First method to attribute influence on video generation at the level of motion, separating it from appearance.
- Scalable gradient‑based pipeline: Works with large, high‑resolution video datasets and state‑of‑the‑art diffusion models.
- Motion‑weighted loss masks: Efficiently focus gradients on temporal changes, enabling fast influence computation.
- Data‑driven fine‑tuning: Demonstrates that selecting high‑influence clips for fine‑tuning yields measurable gains in temporal consistency and physical plausibility.
- Human‑validated improvement: Achieves a 74.1 % human preference win rate over the baseline on the VBench benchmark.
Methodology
- Baseline video generator – The authors start with a pretrained text‑to‑video diffusion model (e.g., Imagen Video, Make‑a‑Video).
- Gradient‑based influence scoring – For each training clip, they back‑propagate the loss of a motion‑weighted mask (higher weight on pixels that change across frames). The resulting gradient magnitude serves as a motion influence score.
- Isolation of motion – By masking out static regions, the loss focuses solely on temporal dynamics, ensuring that the attribution reflects motion impact, not just texture or color.
- Data selection – Clips are ranked by their influence scores. The top‑k high‑influence clips are used for fine‑tuning, while low‑influence or negative‑impact clips can be filtered out.
- Evaluation – The fine‑tuned model is assessed on VBench (a video generation benchmark) and via human preference studies, measuring smoothness, dynamic range, and physical realism.
Results & Findings
- Influence distribution: A small subset (~10 % of the dataset) accounts for the majority of motion improvement potential.
- Temporal consistency boost: Fine‑tuning with Motive‑selected clips raises the VBench motion smoothness score by +0.18 (relative improvement).
- Dynamic degree: The model generates more varied and physically plausible motions (e.g., realistic object trajectories, fluid dynamics).
- Human study: 74.1 % of participants preferred videos from the Motive‑fine‑tuned model over the original baseline.
- Efficiency: Motion‑weighted masks reduce computation time by ~40 % compared to naïve full‑frame gradient attribution, making the approach feasible for datasets with millions of clips.
Practical Implications
- Targeted data curation – Teams can automatically surface the most “motion‑rich” clips for fine‑tuning, saving annotation and compute resources.
- Improved product quality – Applications like AI‑driven video ads, virtual avatars, or game cutscenes can achieve smoother, more believable motion without retraining on the entire dataset.
- Debugging generative models – When a model produces jittery or physically impossible motion, Motive can identify offending training samples, enabling rapid fixes.
- Dataset design – Curators of large video corpora (e.g., stock footage libraries) can prioritize collecting or annotating clips with high motion influence, leading to better downstream generative performance.
- Cross‑modal extensions – The motion‑centric attribution idea could be adapted to audio‑driven video synthesis or multimodal storytelling pipelines, where temporal alignment is critical.
Limitations & Future Work
- Scope to diffusion models – Experiments focus on diffusion‑based text‑to‑video generators; applicability to autoregressive or GAN‑based video models remains to be validated.
- Granularity of masks – The current motion‑weighted mask is a simple per‑pixel temporal gradient; more sophisticated motion representations (optical flow, 3D pose) could yield finer attribution.
- Dataset bias – Influence scores may reflect dataset composition (e.g., over‑representation of certain actions) rather than intrinsic model capacity, requiring careful interpretation.
- Scalability ceiling – While efficient, computing gradients for billions of clips still demands substantial GPU resources; future work could explore approximation or sampling strategies.
- User‑controlled trade‑offs – Integrating Motive into an interactive data‑curation UI, where developers can balance motion improvement against visual fidelity, is an open direction.
Authors
- Xindi Wu
- Despoina Paschalidou
- Jun Gao
- Antonio Torralba
- Laura Leal‑Taixé
- Olga Russakovsky
- Sanja Fidler
- Jonathan Lorraine
Paper Information
- arXiv ID: 2601.08828v1
- Categories: cs.CV, cs.AI, cs.LG, cs.MM, cs.RO
- Published: January 13, 2026
- PDF: Download PDF