[Paper] Efficient Camera-Controlled Video Generation of Static Scenes via Sparse Diffusion and 3D Rendering
Source: arXiv - 2601.09697v1
Overview
The paper presents SRENDER, a new pipeline that turns a handful of diffusion‑generated keyframes into a full‑length, camera‑controlled video of a static scene. By reconstructing a 3‑D representation from those keyframes and rendering the missing frames, the authors achieve >40× speed‑up over pure diffusion video models while keeping visual quality and temporal coherence—an important step toward real‑time generative video for VR/AR, robotics, and interactive media.
Key Contributions
- Sparse‑keyframe generation: Uses a diffusion model only on a small, adaptive set of frames instead of every frame in the video.
- 3‑D lifting & rendering: Converts the keyframes into a unified 3‑D scene (NeRF‑style representation) and renders intermediate viewpoints to fill the video.
- Adaptive keyframe predictor: A lightweight network estimates how many keyframes are needed for a given camera trajectory, allocating compute where motion is complex.
- Speed‑efficiency breakthrough: Demonstrates >40× faster generation of 20‑second clips compared with state‑of‑the‑art diffusion video baselines, with comparable perceptual quality.
- Temporal consistency by design: Geometric reconstruction enforces scene‑wide consistency, eliminating flicker that often plagues frame‑by‑frame diffusion.
Methodology
- Input – A static scene description and a desired camera path (e.g., a 6‑DoF trajectory).
- Keyframe selection – The adaptive predictor decides the minimal number of frames needed to capture the motion’s complexity.
- Diffusion generation – A pretrained text‑to‑image diffusion model (e.g., Stable Diffusion) synthesizes those keyframes conditioned on the camera pose.
- 3‑D reconstruction – The keyframes are fed into a sparse neural radiance field (NeRF) that learns a compact 3‑D representation of the scene. Because only a few views are used, the training is fast and memory‑light.
- Rendering – The NeRF is queried at every intermediate camera pose to produce the missing frames, yielding a smooth video.
- Post‑processing – Optional refinement (e.g., depth‑aware upsampling) cleans up artifacts and aligns colors across frames.
The whole pipeline is modular: any diffusion model can be swapped in, and the 3‑D renderer can be replaced with other view‑synthesis techniques, making it developer‑friendly.
Results & Findings
| Metric | Diffusion‑only baseline | SRENDER (sparse keyframes) |
|---|---|---|
| Generation time (20 s video) | ~30 min (GPU) | ~45 s (GPU) |
| FVD (Frechet Video Distance) | 210 | 225 (≈7% drop) |
| Temporal stability (t‑LPIPS) | 0.12 | 0.09 (better) |
| User study (visual fidelity) | 84 % preferred | 81 % preferred |
- Speed: The 40× speed‑up comes from amortizing the heavy diffusion cost over hundreds of rendered frames.
- Quality: Slight increase in FVD is offset by a noticeable gain in temporal stability, thanks to the shared 3‑D geometry.
- Adaptivity: For simple linear pans, only 3–4 keyframes are enough; for erratic trajectories, the predictor raises the count to ~12, still far fewer than frame‑by‑frame diffusion.
Practical Implications
- Real‑time VR/AR content creation – Developers can generate on‑the‑fly video backdrops that follow a user’s head motion without pre‑rendering every angle.
- Embodied AI simulation – Robots can request scene visualizations for new viewpoints instantly, useful for planning and perception research.
- Interactive media & games – Procedural cutscenes or cinematic replays can be synthesized on demand, reducing storage footprints.
- Cost reduction – Lower GPU hours translate to cheaper cloud inference, making generative video services more economically viable.
- Plug‑and‑play – Since SRENDER builds on existing diffusion checkpoints, teams can adopt it without retraining massive video diffusion models.
Limitations & Future Work
- Static‑scene assumption: Moving objects or dynamic lighting are not handled; extending to dynamic scenes would require temporal 3‑D models.
- NeRF scalability: Very large or highly detailed environments may need more sophisticated scene‑grid or hybrid representations to keep rendering fast.
- Keyframe predictor bias: The predictor is trained on a limited set of trajectories; exotic camera motions could still demand more keyframes than anticipated.
- Resolution ceiling: Current experiments focus on 256×256–512×512 outputs; scaling to 4K video will need optimized rendering pipelines.
Future research directions include integrating dynamic NeRFs, exploring diffusion‑guided mesh reconstruction, and building end‑to‑end trainable pipelines that jointly optimize keyframe selection and 3‑D representation for even tighter speed‑quality trade‑offs.
Authors
- Jieying Chen
- Jeffrey Hu
- Joan Lasenby
- Ayush Tewari
Paper Information
- arXiv ID: 2601.09697v1
- Categories: cs.CV
- Published: January 14, 2026
- PDF: Download PDF