[Paper] MoGAN: Improving Motion Quality in Video Diffusion via Few-Step Motion Adversarial Post-Training

Published: (November 26, 2025 at 12:09 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2511.21592v1

Overview

The paper introduces MoGAN, a lightweight post‑training add‑on that dramatically sharpens the motion realism of fast video diffusion models. By attaching a motion‑focused adversarial discriminator to a distilled 3‑step video diffusion backbone, the authors achieve smoother, more coherent dynamics without sacrificing image quality or inference speed.

Key Contributions

  • Motion‑centric adversarial post‑training: A DiT‑based optical‑flow discriminator is trained to spot unrealistic motion, providing direct temporal supervision that standard MSE denoising lacks.
  • Distribution‑matching regularizer: Keeps the visual fidelity of the original diffusion model intact while the discriminator pushes for better motion.
  • Few‑step efficiency: Works on top of a 3‑step distilled video diffusion model, preserving the speed advantage of recent fast samplers.
  • Strong empirical gains: Improves motion scores by +7–13 % on VBench and VideoJAM‑Bench compared to both the original 50‑step teacher and the 3‑step distilled model, with comparable or better aesthetic scores.
  • Human validation: Preference studies show a clear win for MoGAN on motion quality (52 % vs. 38 % over the teacher; 56 % vs. 29 % over the distilled model).

Methodology

  1. Base model – Start with a 3‑step distilled video diffusion model (e.g., Wan2.1‑T2V‑1.3B) that already produces high‑quality frames quickly.
  2. Optical‑flow discriminator – A DiT (Vision Transformer) network receives short video clips, computes optical flow, and learns to classify whether the motion comes from real video or the diffusion generator.
  3. Adversarial loss – The generator is fine‑tuned to fool the discriminator, directly encouraging temporally consistent motion.
  4. Distribution‑matching regularizer – An additional loss term (e.g., KL or feature‑matching) ensures the fine‑tuned generator does not drift away from the original image‑level distribution, preserving sharpness and color fidelity.
  5. Few‑step post‑training – Only a few epochs of this adversarial fine‑tuning are needed; the underlying diffusion weights remain largely unchanged, keeping inference at 3 steps.

The whole pipeline is a post‑training step, meaning it can be applied to any existing video diffusion model without re‑training from scratch.

Results & Findings

BenchmarkTeacher (50‑step)Distilled (3‑step)MoGAN (3‑step)
VBench – Motion Score+7.3 % over teacher+13.3 % over distilled
VideoJAM‑Bench – Motion Score+7.4 % over teacher+8.8 % over distilled
Aesthetic / Image QualityBaselineComparable / slightly betterComparable / sometimes better
Human Preference (motion)38 %29 %52 % (vs. teacher) / 56 % (vs. distilled)

Key takeaways

  • MoGAN adds significant motion coherence while keeping the same 3‑step runtime.
  • Visual fidelity (sharpness, color, texture) is not degraded; in some cases it even improves due to the regularizer.
  • The approach works without any reward model, reinforcement learning, or human preference data, simplifying deployment.

Practical Implications

  • Fast video generation pipelines (e.g., content creation tools, game asset pipelines) can adopt MoGAN to get smoother motion without paying the cost of 50‑step diffusion.
  • Real‑time or near‑real‑time applications such as AI‑driven video avatars, virtual production, or interactive storytelling benefit from the low latency while avoiding jittery outputs.
  • Since MoGAN is a post‑training plug‑in, existing diffusion‑based services can upgrade motion quality with a few extra fine‑tuning hours rather than a full model rebuild.
  • The optical‑flow discriminator can be swapped for domain‑specific motion critics (e.g., sports, medical imaging), opening a path to custom motion realism for specialized industries.

Limitations & Future Work

  • The method still relies on optical flow as a proxy for motion; extreme fast motions or occlusions where flow estimation fails may limit gains.
  • MoGAN is evaluated on a single backbone (Wan2.1‑T2V‑1.3B); broader validation across other diffusion architectures would strengthen claims.
  • The adversarial fine‑tuning introduces training instability typical of GANs; careful hyper‑parameter tuning is required.
  • Future directions include exploring multi‑scale discriminators, integrating text‑conditioned motion cues, and extending the approach to higher‑resolution video generation.

Authors

  • Haotian Xue
  • Qi Chen
  • Zhonghao Wang
  • Xun Huang
  • Eli Shechtman
  • Jinrong Xie
  • Yongxin Chen

Paper Information

  • arXiv ID: 2511.21592v1
  • Categories: cs.CV
  • Published: November 26, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »