[Paper] SemanticGen: Video Generation in Semantic Space

Published: 1 month ago (December 23, 2025 at 01:59 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.20619v1

Overview

SemanticGen proposes a fresh way to generate videos by first working in a compact semantic space instead of directly manipulating low‑level pixel or VAE latent tokens. By planning the high‑level scene layout first and then filling in details, the model converges faster and scales more efficiently to longer video clips, delivering state‑of‑the‑art visual quality.

Key Contributions

Two‑stage diffusion pipeline
1. A diffusion model creates semantic video features that capture the global motion and scene composition.
2. A second diffusion model translates those features into VAE latents, which are finally decoded into pixels.
Semantic‑first generation reduces redundancy inherent in raw video streams, leading to faster training convergence and lower computational cost for long sequences.
Empirical superiority: Extensive benchmarks show SemanticGen outperforms existing VAE‑latent‑only generators and strong baselines on video quality metrics (e.g., FVD, IS).
Scalable to long videos: The approach maintains quality while generating clips that are significantly longer than those handled efficiently by prior methods.

Methodology

Semantic Feature Extraction
- The authors train a lightweight encoder that maps raw video frames into a high‑level semantic representation (e.g., object layouts, motion cues).
- This representation is far smaller than the full VAE latent space, acting like a “storyboard” for the video.
Stage‑1 Diffusion (Semantic Generation)
- A diffusion model (similar to Denoising Diffusion Probabilistic Models) learns to sample plausible semantic sequences from random noise, guided by a learned prior over video dynamics.
- Because the space is compact, the diffusion process needs fewer steps to reach a coherent global layout.
Stage‑2 Diffusion (Detail Generation)
- Conditioned on the generated semantic sequence, a second diffusion model predicts the corresponding VAE latents.
- This model focuses on high‑frequency details (textures, fine motions) while respecting the global plan supplied by stage‑1.
Decoding
- The VAE decoder converts the latents into pixel frames, producing the final video.

The two‑stage design mirrors how humans storyboard a scene before filling in details, and it sidesteps the need for massive bidirectional attention over thousands of low‑level tokens.

Results & Findings

Metric	SemanticGen	Prior SOTA (VAE‑latent)	Gap
FVD (lower better)	45.2	62.7	-17.5
IS (higher better)	9.8	8.3	+1.5
Training steps to convergence	0.6× of baseline	1.0×	–40%
Inference time for 10‑sec video (GPU)	1.8 s	3.4 s	-47%

Key takeaways

Quality boost across both perceptual (IS) and distributional (FVD) metrics.
Training converges ~40 % faster, confirming the efficiency of the semantic space.
Inference speedup nearly halves the time needed for long clips, making real‑time or near‑real‑time generation more realistic.

Qualitative examples in the paper show smoother motion transitions and better preservation of object identities over extended durations.

Practical Implications

Content creation pipelines (e.g., short‑form video ads, game cinematics) can adopt SemanticGen to prototype longer sequences without prohibitive GPU budgets.
Interactive tools: Because the semantic stage can be edited (e.g., swapping object layouts), developers can build “semantic sliders” that let users steer video generation at a high level before rendering final frames.
Edge‑device deployment: The reduced diffusion steps and smaller intermediate representations lower memory footprints, opening the door to on‑device video synthesis for AR/VR experiences.
Data‑efficient training: Faster convergence means fewer GPU‑hours, which is attractive for startups or research groups with limited compute resources.

Limitations & Future Work

Semantic encoder dependence: The quality of the final video hinges on how well the semantic features capture scene dynamics; rare or highly complex motions may still be under‑represented.
Two‑stage overhead: While each stage is cheaper than a monolithic VAE‑latent diffusion, the pipeline introduces extra engineering complexity (training two diffusion models, synchronizing them).
Generalization to diverse domains: Experiments focus on natural video datasets; applying the method to highly stylized or domain‑specific content (e.g., medical imaging, scientific visualization) may require custom semantic encoders.

Future directions suggested by the authors include:

Learning joint semantic‑latent diffusion to reduce the need for a separate encoder.
Incorporating user‑controlled conditioning (text, sketches) directly into the semantic stage.
Extending the framework to multimodal generation (audio‑video sync, text‑to‑video).

SemanticGen demonstrates that stepping back to think “what’s happening” before “how it looks” can dramatically improve video synthesis. For developers eager to embed generative video into products, the paper offers a practical roadmap to faster, cheaper, and higher‑quality generation.

Authors

Jianhong Bai
Xiaoshi Wu
Xintao Wang
Fu Xiao
Yuanxing Zhang
Qinghe Wang
Xiaoyu Shi
Menghan Xia
Zuozhu Liu
Haoji Hu
Pengfei Wan
Kun Gai

Paper Information

arXiv ID: 2512.20619v1
Categories: cs.CV
Published: December 23, 2025
PDF: Download PDF

[Paper] SemanticGen: Video Generation in Semantic Space

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning

[Paper] ProEdit: Inversion-based Editing From Prompts Done Right

[Paper] Learning Association via Track-Detection Matching for Multi-Object Tracking

[Paper] Yume-1.5: A Text-Controlled Interactive World Generation Model