[Paper] Efficient Camera-Controlled Video Generation of Static Scenes via Sparse Diffusion and 3D Rendering

Published: 3 weeks ago (January 14, 2026 at 01:50 PM EST)

3 min read

Source: arXiv

Source: arXiv - 2601.09697v1

Overview

The paper presents SRENDER, a new pipeline that turns a handful of diffusion‑generated keyframes into a full‑length, camera‑controlled video of a static scene. By reconstructing a 3‑D representation from those keyframes and rendering the missing frames, the authors achieve >40× speed‑up over pure diffusion video models while keeping visual quality and temporal coherence—an important step toward real‑time generative video for VR/AR, robotics, and interactive media.

Key Contributions

Sparse‑keyframe generation: Uses a diffusion model only on a small, adaptive set of frames instead of every frame in the video.
3‑D lifting & rendering: Converts the keyframes into a unified 3‑D scene (NeRF‑style representation) and renders intermediate viewpoints to fill the video.
Adaptive keyframe predictor: A lightweight network estimates how many keyframes are needed for a given camera trajectory, allocating compute where motion is complex.
Speed‑efficiency breakthrough: Demonstrates >40× faster generation of 20‑second clips compared with state‑of‑the‑art diffusion video baselines, with comparable perceptual quality.
Temporal consistency by design: Geometric reconstruction enforces scene‑wide consistency, eliminating flicker that often plagues frame‑by‑frame diffusion.

Methodology

Input – A static scene description and a desired camera path (e.g., a 6‑DoF trajectory).
Keyframe selection – The adaptive predictor decides the minimal number of frames needed to capture the motion’s complexity.
Diffusion generation – A pretrained text‑to‑image diffusion model (e.g., Stable Diffusion) synthesizes those keyframes conditioned on the camera pose.
3‑D reconstruction – The keyframes are fed into a sparse neural radiance field (NeRF) that learns a compact 3‑D representation of the scene. Because only a few views are used, the training is fast and memory‑light.
Rendering – The NeRF is queried at every intermediate camera pose to produce the missing frames, yielding a smooth video.
Post‑processing – Optional refinement (e.g., depth‑aware upsampling) cleans up artifacts and aligns colors across frames.

The whole pipeline is modular: any diffusion model can be swapped in, and the 3‑D renderer can be replaced with other view‑synthesis techniques, making it developer‑friendly.

Results & Findings

Metric	Diffusion‑only baseline	SRENDER (sparse keyframes)
Generation time (20 s video)	~30 min (GPU)	~45 s (GPU)
FVD (Frechet Video Distance)	210	225 (≈7% drop)
Temporal stability (t‑LPIPS)	0.12	0.09 (better)
User study (visual fidelity)	84 % preferred	81 % preferred

Speed: The 40× speed‑up comes from amortizing the heavy diffusion cost over hundreds of rendered frames.
Quality: Slight increase in FVD is offset by a noticeable gain in temporal stability, thanks to the shared 3‑D geometry.
Adaptivity: For simple linear pans, only 3–4 keyframes are enough; for erratic trajectories, the predictor raises the count to ~12, still far fewer than frame‑by‑frame diffusion.

Practical Implications

Real‑time VR/AR content creation – Developers can generate on‑the‑fly video backdrops that follow a user’s head motion without pre‑rendering every angle.
Embodied AI simulation – Robots can request scene visualizations for new viewpoints instantly, useful for planning and perception research.
Interactive media & games – Procedural cutscenes or cinematic replays can be synthesized on demand, reducing storage footprints.
Cost reduction – Lower GPU hours translate to cheaper cloud inference, making generative video services more economically viable.
Plug‑and‑play – Since SRENDER builds on existing diffusion checkpoints, teams can adopt it without retraining massive video diffusion models.

Limitations & Future Work

Static‑scene assumption: Moving objects or dynamic lighting are not handled; extending to dynamic scenes would require temporal 3‑D models.
NeRF scalability: Very large or highly detailed environments may need more sophisticated scene‑grid or hybrid representations to keep rendering fast.
Keyframe predictor bias: The predictor is trained on a limited set of trajectories; exotic camera motions could still demand more keyframes than anticipated.
Resolution ceiling: Current experiments focus on 256×256–512×512 outputs; scaling to 4K video will need optimized rendering pipelines.

Future research directions include integrating dynamic NeRFs, exploring diffusion‑guided mesh reconstruction, and building end‑to‑end trainable pipelines that jointly optimize keyframe selection and 3‑D representation for even tighter speed‑quality trade‑offs.

Authors

Jieying Chen
Jeffrey Hu
Joan Lasenby
Ayush Tewari

Paper Information

arXiv ID: 2601.09697v1
Categories: cs.CV
Published: January 14, 2026
PDF: Download PDF

[Paper] Efficient Camera-Controlled Video Generation of Static Scenes via Sparse Diffusion and 3D Rendering

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] UniX: Unifying Autoregression and Diffusion for Chest X-Ray Understanding and Generation

[Paper] ShapeR: Robust Conditional 3D Shape Generation from Casual Captures

[Paper] ReScene4D: Temporally Consistent Semantic Instance Segmentation of Evolving Indoor 3D Scenes

[Paper] CTest-Metric: A Unified Framework to Assess Clinical Validity of Metrics for CT Report Generation