[Paper] MoGAN: Improving Motion Quality in Video Diffusion via Few-Step Motion Adversarial Post-Training
Source: arXiv - 2511.21592v1
Overview
The paper introduces MoGAN, a lightweight post‑training add‑on that dramatically sharpens the motion realism of fast video diffusion models. By attaching a motion‑focused adversarial discriminator to a distilled 3‑step video diffusion backbone, the authors achieve smoother, more coherent dynamics without sacrificing image quality or inference speed.
Key Contributions
- Motion‑centric adversarial post‑training: A DiT‑based optical‑flow discriminator is trained to spot unrealistic motion, providing direct temporal supervision that standard MSE denoising lacks.
- Distribution‑matching regularizer: Keeps the visual fidelity of the original diffusion model intact while the discriminator pushes for better motion.
- Few‑step efficiency: Works on top of a 3‑step distilled video diffusion model, preserving the speed advantage of recent fast samplers.
- Strong empirical gains: Improves motion scores by +7–13 % on VBench and VideoJAM‑Bench compared to both the original 50‑step teacher and the 3‑step distilled model, with comparable or better aesthetic scores.
- Human validation: Preference studies show a clear win for MoGAN on motion quality (52 % vs. 38 % over the teacher; 56 % vs. 29 % over the distilled model).
Methodology
- Base model – Start with a 3‑step distilled video diffusion model (e.g., Wan2.1‑T2V‑1.3B) that already produces high‑quality frames quickly.
- Optical‑flow discriminator – A DiT (Vision Transformer) network receives short video clips, computes optical flow, and learns to classify whether the motion comes from real video or the diffusion generator.
- Adversarial loss – The generator is fine‑tuned to fool the discriminator, directly encouraging temporally consistent motion.
- Distribution‑matching regularizer – An additional loss term (e.g., KL or feature‑matching) ensures the fine‑tuned generator does not drift away from the original image‑level distribution, preserving sharpness and color fidelity.
- Few‑step post‑training – Only a few epochs of this adversarial fine‑tuning are needed; the underlying diffusion weights remain largely unchanged, keeping inference at 3 steps.
The whole pipeline is a post‑training step, meaning it can be applied to any existing video diffusion model without re‑training from scratch.
Results & Findings
| Benchmark | Teacher (50‑step) | Distilled (3‑step) | MoGAN (3‑step) |
|---|---|---|---|
| VBench – Motion Score | – | +7.3 % over teacher | +13.3 % over distilled |
| VideoJAM‑Bench – Motion Score | – | +7.4 % over teacher | +8.8 % over distilled |
| Aesthetic / Image Quality | Baseline | Comparable / slightly better | Comparable / sometimes better |
| Human Preference (motion) | 38 % | 29 % | 52 % (vs. teacher) / 56 % (vs. distilled) |
Key takeaways
- MoGAN adds significant motion coherence while keeping the same 3‑step runtime.
- Visual fidelity (sharpness, color, texture) is not degraded; in some cases it even improves due to the regularizer.
- The approach works without any reward model, reinforcement learning, or human preference data, simplifying deployment.
Practical Implications
- Fast video generation pipelines (e.g., content creation tools, game asset pipelines) can adopt MoGAN to get smoother motion without paying the cost of 50‑step diffusion.
- Real‑time or near‑real‑time applications such as AI‑driven video avatars, virtual production, or interactive storytelling benefit from the low latency while avoiding jittery outputs.
- Since MoGAN is a post‑training plug‑in, existing diffusion‑based services can upgrade motion quality with a few extra fine‑tuning hours rather than a full model rebuild.
- The optical‑flow discriminator can be swapped for domain‑specific motion critics (e.g., sports, medical imaging), opening a path to custom motion realism for specialized industries.
Limitations & Future Work
- The method still relies on optical flow as a proxy for motion; extreme fast motions or occlusions where flow estimation fails may limit gains.
- MoGAN is evaluated on a single backbone (Wan2.1‑T2V‑1.3B); broader validation across other diffusion architectures would strengthen claims.
- The adversarial fine‑tuning introduces training instability typical of GANs; careful hyper‑parameter tuning is required.
- Future directions include exploring multi‑scale discriminators, integrating text‑conditioned motion cues, and extending the approach to higher‑resolution video generation.
Authors
- Haotian Xue
- Qi Chen
- Zhonghao Wang
- Xun Huang
- Eli Shechtman
- Jinrong Xie
- Yongxin Chen
Paper Information
- arXiv ID: 2511.21592v1
- Categories: cs.CV
- Published: November 26, 2025
- PDF: Download PDF