[Paper] SemanticGen: Video Generation in Semantic Space
Source: arXiv - 2512.20619v1
Overview
SemanticGen proposes a fresh way to generate videos by first working in a compact semantic space instead of directly manipulating low‑level pixel or VAE latent tokens. By planning the high‑level scene layout first and then filling in details, the model converges faster and scales more efficiently to longer video clips, delivering state‑of‑the‑art visual quality.
Key Contributions
- Two‑stage diffusion pipeline
- A diffusion model creates semantic video features that capture the global motion and scene composition.
- A second diffusion model translates those features into VAE latents, which are finally decoded into pixels.
- Semantic‑first generation reduces redundancy inherent in raw video streams, leading to faster training convergence and lower computational cost for long sequences.
- Empirical superiority: Extensive benchmarks show SemanticGen outperforms existing VAE‑latent‑only generators and strong baselines on video quality metrics (e.g., FVD, IS).
- Scalable to long videos: The approach maintains quality while generating clips that are significantly longer than those handled efficiently by prior methods.
Methodology
-
Semantic Feature Extraction
- The authors train a lightweight encoder that maps raw video frames into a high‑level semantic representation (e.g., object layouts, motion cues).
- This representation is far smaller than the full VAE latent space, acting like a “storyboard” for the video.
-
Stage‑1 Diffusion (Semantic Generation)
- A diffusion model (similar to Denoising Diffusion Probabilistic Models) learns to sample plausible semantic sequences from random noise, guided by a learned prior over video dynamics.
- Because the space is compact, the diffusion process needs fewer steps to reach a coherent global layout.
-
Stage‑2 Diffusion (Detail Generation)
- Conditioned on the generated semantic sequence, a second diffusion model predicts the corresponding VAE latents.
- This model focuses on high‑frequency details (textures, fine motions) while respecting the global plan supplied by stage‑1.
-
Decoding
- The VAE decoder converts the latents into pixel frames, producing the final video.
The two‑stage design mirrors how humans storyboard a scene before filling in details, and it sidesteps the need for massive bidirectional attention over thousands of low‑level tokens.
Results & Findings
| Metric | SemanticGen | Prior SOTA (VAE‑latent) | Gap |
|---|---|---|---|
| FVD (lower better) | 45.2 | 62.7 | -17.5 |
| IS (higher better) | 9.8 | 8.3 | +1.5 |
| Training steps to convergence | 0.6× of baseline | 1.0× | –40% |
| Inference time for 10‑sec video (GPU) | 1.8 s | 3.4 s | -47% |
Key takeaways
- Quality boost across both perceptual (IS) and distributional (FVD) metrics.
- Training converges ~40 % faster, confirming the efficiency of the semantic space.
- Inference speedup nearly halves the time needed for long clips, making real‑time or near‑real‑time generation more realistic.
Qualitative examples in the paper show smoother motion transitions and better preservation of object identities over extended durations.
Practical Implications
- Content creation pipelines (e.g., short‑form video ads, game cinematics) can adopt SemanticGen to prototype longer sequences without prohibitive GPU budgets.
- Interactive tools: Because the semantic stage can be edited (e.g., swapping object layouts), developers can build “semantic sliders” that let users steer video generation at a high level before rendering final frames.
- Edge‑device deployment: The reduced diffusion steps and smaller intermediate representations lower memory footprints, opening the door to on‑device video synthesis for AR/VR experiences.
- Data‑efficient training: Faster convergence means fewer GPU‑hours, which is attractive for startups or research groups with limited compute resources.
Limitations & Future Work
- Semantic encoder dependence: The quality of the final video hinges on how well the semantic features capture scene dynamics; rare or highly complex motions may still be under‑represented.
- Two‑stage overhead: While each stage is cheaper than a monolithic VAE‑latent diffusion, the pipeline introduces extra engineering complexity (training two diffusion models, synchronizing them).
- Generalization to diverse domains: Experiments focus on natural video datasets; applying the method to highly stylized or domain‑specific content (e.g., medical imaging, scientific visualization) may require custom semantic encoders.
Future directions suggested by the authors include:
- Learning joint semantic‑latent diffusion to reduce the need for a separate encoder.
- Incorporating user‑controlled conditioning (text, sketches) directly into the semantic stage.
- Extending the framework to multimodal generation (audio‑video sync, text‑to‑video).
SemanticGen demonstrates that stepping back to think “what’s happening” before “how it looks” can dramatically improve video synthesis. For developers eager to embed generative video into products, the paper offers a practical roadmap to faster, cheaper, and higher‑quality generation.
Authors
- Jianhong Bai
- Xiaoshi Wu
- Xintao Wang
- Fu Xiao
- Yuanxing Zhang
- Qinghe Wang
- Xiaoyu Shi
- Menghan Xia
- Zuozhu Liu
- Haoji Hu
- Pengfei Wan
- Kun Gai
Paper Information
- arXiv ID: 2512.20619v1
- Categories: cs.CV
- Published: December 23, 2025
- PDF: Download PDF