[Paper] OmniView: An All-Seeing Diffusion Model for 3D and 4D View Synthesis
Source: arXiv - 2512.10940v1
Overview
OmniView is a single diffusion model that can generate consistent 3‑D scenes and 4‑D videos while giving developers fine‑grained control over camera motion, time, and visual prompts. By decoupling the representation of space, time, and view conditions, the authors show that one network can handle a wide spectrum of tasks—novel‑view synthesis from static or dynamic inputs, trajectory extrapolation, and text‑or image‑driven video creation with arbitrary camera paths—without needing separate, task‑specific models.
Key Contributions
- Unified 4‑D diffusion framework that jointly learns spatial, temporal, and view conditioning, eliminating the need for multiple specialized models.
- Modular condition representation: separate embeddings for scene geometry, motion, and camera pose enable arbitrary combinations (e.g., static image + dynamic camera, video + new viewpoint).
- State‑of‑the‑art performance on several benchmark suites, surpassing dedicated baselines by up to 33 % (LLFF multiview NVS), 60 % (Neural 3D Video dynamic NVS), and 20 % (RE‑10K static camera control).
- Significant reduction in camera trajectory error (≈ 4×) for text‑conditioned video generation, proving better adherence to user‑specified motion.
- Open‑source release of code, pretrained weights, and an interactive demo, encouraging rapid adoption and further research.
Methodology
OmniView builds on the latent diffusion architecture but introduces three orthogonal conditioning streams:
| Conditioning | What it encodes | How it’s fed to the model |
|---|---|---|
| Space | 3‑D geometry or static scene layout (e.g., depth maps, point clouds) | Embedded as a spatial token sequence that aligns with the latent image grid. |
| Time | Temporal dynamics (frame index, motion vectors) | Injected via sinusoidal time embeddings, similar to video diffusion models. |
| View | Camera pose (position, orientation, focal length) | Represented as a 6‑DoF vector, projected into a learned view‑embedding space. |
During training, the model receives randomly sampled triples (space, time, view) from a heterogeneous 4‑D dataset that mixes static multiview captures, dynamic scenes, and text‑to‑video clips. The diffusion loss is computed as usual, but the conditioning tokens are concatenated with the latent image tokens, allowing the UNet to attend to any subset of conditions at inference time.
Because the three condition types are independent, the same network can be invoked with any combination:
- Static → New View: supply space + target view, leave time blank.
- Dynamic → New View: supply space + time + target view.
- Text → Video + Camera: supply text prompt + view trajectory, optionally a seed frame for style guidance.
The authors also introduce a trajectory consistency regularizer that penalizes deviations between the predicted camera pose embeddings and the ground‑truth trajectory, which is key to the observed reduction in trajectory error.
Results & Findings
| Benchmark | Task | OmniView vs. Best Specialized Model | Metric Improvement |
|---|---|---|---|
| LLFF (multiview NVS) | Novel view synthesis from static multiview inputs | PSNR ↑ 33 % | Higher fidelity reconstructions, sharper edges |
| Neural 3D Video | Dynamic scene NVS (moving objects) | PSNR ↑ 60 % | Better handling of motion blur and occlusions |
| RE‑10K | Static camera control (single image → video) | PSNR ↑ 20 % | Smoother temporal coherence |
| Text‑to‑Video (camera‑controlled) | Follow user‑specified trajectory | Trajectory error ↓ 4× | Video follows the intended path much more faithfully |
Qualitatively, OmniView produces videos where camera motion feels natural, even when the underlying scene is generated from a single image or a short clip. The model also demonstrates zero‑shot generalization: it can synthesize a novel view of a scene it has never seen during training, simply by providing a depth estimate and a new pose.
Practical Implications
- Rapid prototyping of AR/VR content – developers can feed a handful of reference images or a short video and instantly generate immersive 360° experiences with custom camera paths.
- Automated video editing – integrate OmniView into pipelines to re‑frame existing footage, create smooth dolly‑in/out effects, or generate missing frames for stabilization.
- Game asset generation – generate consistent sprite sheets or cutscene videos from concept art, reducing manual animation effort.
- Content moderation & synthetic data – produce diverse, camera‑controlled synthetic datasets for training perception models (e.g., autonomous driving) without hand‑crafting multiple scene variants.
- Creative tools – plug into text‑to‑video editors (e.g., Runway, Adobe) to give artists precise control over camera choreography while keeping the diffusion‑generated visual quality.
Because the model is single‑purpose and runs on standard GPU hardware (the authors report ~2 fps for 512×512 video generation on an RTX 3090), integrating it into existing workflows is far less cumbersome than maintaining a suite of specialized models.
Limitations & Future Work
- Training data bias – the model inherits the distribution of the mixed 4‑D dataset; exotic camera rigs or extreme lighting conditions may still produce artifacts.
- Resolution ceiling – current experiments top out at 512×512; scaling to 4K video will require memory‑efficient diffusion tricks or cascaded up‑sampling.
- Real‑time interactivity – while inference is fast for offline generation, true real‑time control (e.g., live AR) remains out of reach.
- Explicit geometry – OmniView treats depth as an auxiliary condition; future work could integrate a learned 3‑D representation (NeRF‑style) for tighter geometry‑consistency.
- Broader modality conditioning – extending the conditioning framework to audio, haptics, or semantic maps could unlock richer multimodal synthesis.
The authors plan to explore larger, more diverse training corpora, efficient diffusion samplers, and tight coupling with neural rendering pipelines to push the boundaries of generalist 4‑D generation.
Authors
- Xiang Fan
- Sharath Girish
- Vivek Ramanujan
- Chaoyang Wang
- Ashkan Mirzaei
- Petr Sushko
- Aliaksandr Siarohin
- Sergey Tulyakov
- Ranjay Krishna
Paper Information
- arXiv ID: 2512.10940v1
- Categories: cs.CV, cs.AI
- Published: December 11, 2025
- PDF: Download PDF