[Paper] Structure From Tracking: Distilling Structure-Preserving Motion for Video Generation

Published: (December 12, 2025 at 01:56 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.11792v1

Overview

The paper presents SAM2VideoX, a new video‑generation model that learns to keep the underlying structure of moving objects—especially articulated bodies like humans and animals—while still producing realistic, high‑fidelity motion. By distilling motion priors from a state‑of‑the‑art autoregressive tracker (SAM2) into a bidirectional diffusion model (CogVideoX), the authors achieve a noticeable jump in both objective metrics and human preference scores.

Key Contributions

  • Structure‑preserving motion distillation: Introduces a pipeline that extracts global motion priors from a tracking model (SAM2) and injects them into a diffusion‑based video generator.
  • Bidirectional feature‑fusion module: A lightweight architecture that merges forward‑ and backward‑time features from the tracker, giving the diffusion model a coherent sense of object layout across the whole clip.
  • Local Gram Flow loss: A novel regularizer that aligns the relative movement of local feature patches, encouraging consistent deformation without explicit optical‑flow supervision.
  • State‑of‑the‑art results: Sets new scores on the VBench benchmark (95.51 % overall, +2.60 % over the previous best) and reduces Fréchet Video Distance (FVD) by >20 % compared to strong baselines.
  • Human‑centric evaluation: Shows a 71.4 % preference rate in user studies, indicating that the generated videos feel more natural to everyday viewers.

Methodology

  1. Teacher model – SAM2 tracking

    • SAM2 is an autoregressive video tracker that predicts object masks frame‑by‑frame, preserving the geometry of rigid and deformable parts.
    • Its hidden states encode rich motion cues (e.g., how a limb rotates or a tail wiggles) but are not directly usable for generation.
  2. Student model – CogVideoX diffusion

    • CogVideoX is a bidirectional video diffusion model that synthesizes frames from noise while conditioning on textual prompts.
    • The authors augment it with a bidirectional feature‑fusion module that ingests SAM2’s forward and backward hidden representations, effectively giving the diffusion model a “motion roadmap”.
  3. Training with Structure‑aware losses

    • In addition to the standard diffusion loss, they add the Local Gram Flow loss, which computes Gram matrices (inner‑product of local feature vectors) for neighboring patches across time. Matching these matrices forces the generator to keep local texture and shape moving together, mimicking the coherent motion observed in the tracker.
  4. Distillation pipeline

    • The tracker runs on the same video data used to train the diffusion model, producing motion priors.
    • These priors are treated as soft targets; the diffusion model learns to reproduce them while still being guided by the text prompt.

The overall training loop is straightforward: sample a video, run SAM2 to collect motion features, feed those into the fusion module, and back‑propagate both diffusion and Gram‑flow losses.

Results & Findings

MetricSAM2VideoXREPA (prev. SOTA)LoRA‑finetuned CogVideoX
VBench overall score95.51 % (+2.60 %)92.91 %
FVD (lower is better)360.57 (‑21 % vs REPA, ‑22 % vs LoRA)~458~464
Human preference (pairwise)71.4 %28.6 %
  • Consistent gains across categories: The model excels on both rigid‑object videos (e.g., vehicles) and highly deformable subjects (e.g., dancing humans, animals).
  • Qualitative improvements: Visual examples show smoother limb articulation, fewer “ghosting” artifacts, and better preservation of object silhouettes during fast motion.
  • Ablation studies: Removing the bidirectional fusion drops VBench by ~1.4 %, while omitting the Local Gram Flow loss reduces human preference by ~9 %, confirming each component’s impact.

Practical Implications

  • Content creation pipelines: Studios and indie developers can generate higher‑quality animated assets (e.g., character motion clips) with fewer manual keyframes, saving time on motion‑capture cleanup.
  • AR/VR and gaming: Real‑time avatars or NPCs can be driven by text prompts while retaining physically plausible limb movement, reducing the need for handcrafted animation rigs.
  • Synthetic data for training: Better‑structured video synthesis can feed downstream computer‑vision models (e.g., pose estimation, action recognition) with more realistic training data, potentially improving robustness.
  • Cross‑modal storytelling: Combining SAM2VideoX with existing text‑to‑video tools enables creators to script complex scenes (e.g., “a cat leaps onto a moving train”) without worrying about implausible deformations.

Limitations & Future Work

  • Dependency on tracker quality: SAM2’s performance still degrades on extreme occlusions or very fast motion, which can propagate errors into the diffusion model.
  • Computational cost: The bidirectional fusion and Gram‑flow loss add overhead, making training slower than vanilla diffusion models.
  • Generalization to unseen domains: While the model handles humans and animals well, performance on highly non‑articulated or abstract visual domains (e.g., fluid simulations) remains untested.
  • Future directions: The authors suggest integrating more robust multi‑object trackers, exploring lightweight fusion alternatives for real‑time inference, and extending the framework to 3‑D video generation or controllable style transfer.

Authors

  • Yang Fei
  • George Stoica
  • Jingyuan Liu
  • Qifeng Chen
  • Ranjay Krishna
  • Xiaojuan Wang
  • Benlin Liu

Paper Information

  • arXiv ID: 2512.11792v1
  • Categories: cs.CV
  • Published: December 12, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »