[Paper] OmniView: An All-Seeing Diffusion Model for 3D and 4D View Synthesis

Published: 1 month ago (December 11, 2025 at 01:59 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.10940v1

Overview

OmniView is a single diffusion model that can generate consistent 3‑D scenes and 4‑D videos while giving developers fine‑grained control over camera motion, time, and visual prompts. By decoupling the representation of space, time, and view conditions, the authors show that one network can handle a wide spectrum of tasks—novel‑view synthesis from static or dynamic inputs, trajectory extrapolation, and text‑or image‑driven video creation with arbitrary camera paths—without needing separate, task‑specific models.

Key Contributions

Unified 4‑D diffusion framework that jointly learns spatial, temporal, and view conditioning, eliminating the need for multiple specialized models.
Modular condition representation: separate embeddings for scene geometry, motion, and camera pose enable arbitrary combinations (e.g., static image + dynamic camera, video + new viewpoint).
State‑of‑the‑art performance on several benchmark suites, surpassing dedicated baselines by up to 33 % (LLFF multiview NVS), 60 % (Neural 3D Video dynamic NVS), and 20 % (RE‑10K static camera control).
Significant reduction in camera trajectory error (≈ 4×) for text‑conditioned video generation, proving better adherence to user‑specified motion.
Open‑source release of code, pretrained weights, and an interactive demo, encouraging rapid adoption and further research.

Methodology

OmniView builds on the latent diffusion architecture but introduces three orthogonal conditioning streams:

Conditioning	What it encodes	How it’s fed to the model
Space	3‑D geometry or static scene layout (e.g., depth maps, point clouds)	Embedded as a spatial token sequence that aligns with the latent image grid.
Time	Temporal dynamics (frame index, motion vectors)	Injected via sinusoidal time embeddings, similar to video diffusion models.
View	Camera pose (position, orientation, focal length)	Represented as a 6‑DoF vector, projected into a learned view‑embedding space.

During training, the model receives randomly sampled triples (space, time, view) from a heterogeneous 4‑D dataset that mixes static multiview captures, dynamic scenes, and text‑to‑video clips. The diffusion loss is computed as usual, but the conditioning tokens are concatenated with the latent image tokens, allowing the UNet to attend to any subset of conditions at inference time.

Because the three condition types are independent, the same network can be invoked with any combination:

Static → New View: supply space + target view, leave time blank.
Dynamic → New View: supply space + time + target view.
Text → Video + Camera: supply text prompt + view trajectory, optionally a seed frame for style guidance.

The authors also introduce a trajectory consistency regularizer that penalizes deviations between the predicted camera pose embeddings and the ground‑truth trajectory, which is key to the observed reduction in trajectory error.

Results & Findings

Benchmark	Task	OmniView vs. Best Specialized Model	Metric Improvement
LLFF (multiview NVS)	Novel view synthesis from static multiview inputs	PSNR ↑ 33 %	Higher fidelity reconstructions, sharper edges
Neural 3D Video	Dynamic scene NVS (moving objects)	PSNR ↑ 60 %	Better handling of motion blur and occlusions
RE‑10K	Static camera control (single image → video)	PSNR ↑ 20 %	Smoother temporal coherence
Text‑to‑Video (camera‑controlled)	Follow user‑specified trajectory	Trajectory error ↓ 4×	Video follows the intended path much more faithfully

Qualitatively, OmniView produces videos where camera motion feels natural, even when the underlying scene is generated from a single image or a short clip. The model also demonstrates zero‑shot generalization: it can synthesize a novel view of a scene it has never seen during training, simply by providing a depth estimate and a new pose.

Practical Implications

Rapid prototyping of AR/VR content – developers can feed a handful of reference images or a short video and instantly generate immersive 360° experiences with custom camera paths.
Automated video editing – integrate OmniView into pipelines to re‑frame existing footage, create smooth dolly‑in/out effects, or generate missing frames for stabilization.
Game asset generation – generate consistent sprite sheets or cutscene videos from concept art, reducing manual animation effort.
Content moderation & synthetic data – produce diverse, camera‑controlled synthetic datasets for training perception models (e.g., autonomous driving) without hand‑crafting multiple scene variants.
Creative tools – plug into text‑to‑video editors (e.g., Runway, Adobe) to give artists precise control over camera choreography while keeping the diffusion‑generated visual quality.

Because the model is single‑purpose and runs on standard GPU hardware (the authors report ~2 fps for 512×512 video generation on an RTX 3090), integrating it into existing workflows is far less cumbersome than maintaining a suite of specialized models.

Limitations & Future Work

Training data bias – the model inherits the distribution of the mixed 4‑D dataset; exotic camera rigs or extreme lighting conditions may still produce artifacts.
Resolution ceiling – current experiments top out at 512×512; scaling to 4K video will require memory‑efficient diffusion tricks or cascaded up‑sampling.
Real‑time interactivity – while inference is fast for offline generation, true real‑time control (e.g., live AR) remains out of reach.
Explicit geometry – OmniView treats depth as an auxiliary condition; future work could integrate a learned 3‑D representation (NeRF‑style) for tighter geometry‑consistency.
Broader modality conditioning – extending the conditioning framework to audio, haptics, or semantic maps could unlock richer multimodal synthesis.

The authors plan to explore larger, more diverse training corpora, efficient diffusion samplers, and tight coupling with neural rendering pipelines to push the boundaries of generalist 4‑D generation.

Authors

Xiang Fan
Sharath Girish
Vivek Ramanujan
Chaoyang Wang
Ashkan Mirzaei
Petr Sushko
Aliaksandr Siarohin
Sergey Tulyakov
Ranjay Krishna

Paper Information

arXiv ID: 2512.10940v1
Categories: cs.CV, cs.AI
Published: December 11, 2025
PDF: Download PDF

[Paper] OmniView: An All-Seeing Diffusion Model for 3D and 4D View Synthesis

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Particulate: Feed-Forward 3D Object Articulation

[Paper] Smudged Fingerprints: A Systematic Evaluation of the Robustness of AI Image Fingerprints

[Paper] DentalGPT: Incentivizing Multimodal Complex Reasoning in Dentistry

[Paper] Parallax: Runtime Parallelization for Operator Fallbacks in Heterogeneous Edge Systems