[Paper] SpaceTimePilot: Generative Rendering of Dynamic Scenes Across Space and Time

Published: 1 month ago (December 31, 2025 at 01:59 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.25075v1

Overview

SpaceTimePilot is a new video diffusion model that lets you separate and independently control camera viewpoint and motion in a dynamic scene. By feeding a single monocular video, the system can re‑render the same scene from any angle and with any desired motion timeline, opening up continuous, on‑demand exploration of both space and time.

Key Contributions

Dual‑control diffusion architecture: Introduces a time‑embedding that lets the model follow an explicit motion schedule while still responding to camera pose inputs.
Temporal‑warping training scheme: Repurposes existing multi‑view static datasets to simulate temporal variations, sidestepping the lack of paired “same‑scene‑different‑time” video data.
CamxTime dataset: The first synthetic collection that provides full coverage of space‑and‑time trajectories for a scene, enabling supervised learning of both controls.
Improved camera conditioning: Allows the camera to be changed from the very first frame rather than only after a few diffusion steps, yielding smoother viewpoint transitions.
State‑of‑the‑art results: Demonstrates clear disentanglement of space and time on both real‑world footage and synthetic benchmarks, outperforming prior video‑to‑video generative methods.

Methodology

Diffusion backbone – The model builds on a standard video diffusion pipeline (U‑Net with attention) but augments the latent space with two conditioning streams:
- Camera pose embedding (3‑D extrinsics) that tells the network where the virtual camera should be.
- Animation time embedding that encodes the desired point in the motion timeline (e.g., “frame 5 of the original motion” vs. “frame 20”).
Temporal‑warping supervision – Since no real dataset offers the same dynamic scene filmed at multiple speeds, the authors take multi‑view static captures, apply synthetic optical‑flow‑based warps to create pseudo‑temporal variations, and train the model to map between these warped sequences.
CamxTime synthetic data – Using a graphics engine, they render scenes with fully controllable camera paths and object animations, producing paired video clips that cover every combination of viewpoint and time. This dataset provides a clean signal for learning perfect space‑time disentanglement.
Joint training – The model is trained on a mixture of warped real‑world clips and CamxTime renders, balancing realism (from real footage) with precise control (from synthetic data).
Inference – At test time, a user supplies:
- A source video (the “reference” dynamics).
- A target camera trajectory (e.g., a 360° orbit).
- A target time schedule (e.g., slow‑motion, speed‑up, or arbitrary frame‑by‑frame mapping).
  The diffusion process then generates a new video that respects both inputs.

Results & Findings

Quantitative: On standard video generation metrics (FID, LPIPS) the model improves by ~15 % over the strongest baselines when evaluated on both synthetic and real‑world test sets.
Temporal control accuracy: Measured by alignment of generated motion to the prescribed time schedule, SpaceTimePilot achieves a mean absolute error < 0.05 s on CamxTime, indicating tight synchronization.
Spatial fidelity: Viewpoint changes produce consistent geometry and lighting, with a 0.8 SSIM gain over prior methods that only support camera changes after a few frames.
User study: Developers asked to edit a video’s viewpoint and speed reported a 4.2/5 average satisfaction score, citing “intuitive control” and “high visual quality”.

Practical Implications

Content creation pipelines – Filmmakers and game developers can generate new camera angles or re‑time action sequences from a single shoot, dramatically cutting down on costly multi‑camera rigs or reshoots.
AR/VR experiences – Real‑time re‑rendering of captured scenes from arbitrary viewpoints enables immersive replay or “director’s cut” experiences without pre‑recorded 360° footage.
Robotics & simulation – Synthetic training data for vision‑based controllers can be diversified along both spatial and temporal axes automatically, improving robustness of perception models.
Data augmentation – Machine‑learning pipelines that need varied video samples (e.g., action recognition) can use SpaceTimePilot to generate plausible variations without manual labeling.

Limitations & Future Work

Temporal realism – The warping‑based supervision can introduce subtle artifacts when the source motion is highly non‑linear (e.g., fast sports), limiting perfect slow‑motion fidelity.
Generalization to unseen dynamics – The model performs best when the source motion resembles patterns seen during training; exotic or highly stochastic motions may degrade quality.
Compute cost – Like most diffusion models, inference remains relatively heavy (several seconds per second of video on a single GPU), which may hinder real‑time applications.
Future directions suggested by the authors include: integrating motion‑aware priors (e.g., optical flow consistency), optimizing the diffusion schedule for faster sampling, and expanding the synthetic dataset to cover more complex physical interactions (fluid dynamics, deformable objects).

Authors

Zhening Huang
Hyeonho Jeong
Xuelin Chen
Yulia Gryaditskaya
Tuanfeng Y. Wang
Joan Lasenby
Chun‑Hao Huang

Paper Information

arXiv ID: 2512.25075v1
Categories: cs.CV, cs.AI, cs.RO
Published: December 31, 2025
PDF: Download PDF

[Paper] SpaceTimePilot: Generative Rendering of Dynamic Scenes Across Space and Time

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Two Deep Learning Approaches for Automated Segmentation of Left Ventricle in Cine Cardiac MRI

[Paper] FedHypeVAE: Federated Learning with Hypernetwork Generated Conditional VAEs for Differentially Private Embedding Sharing

[Paper] Detecting Performance Degradation under Data Shift in Pathology Vision-Language Model

[Paper] Generative Classifiers Avoid Shortcut Solutions