[Paper] SpaceTimePilot: Generative Rendering of Dynamic Scenes Across Space and Time
Source: arXiv - 2512.25075v1
Overview
SpaceTimePilot is a new video diffusion model that lets you separate and independently control camera viewpoint and motion in a dynamic scene. By feeding a single monocular video, the system can re‑render the same scene from any angle and with any desired motion timeline, opening up continuous, on‑demand exploration of both space and time.
Key Contributions
- Dual‑control diffusion architecture: Introduces a time‑embedding that lets the model follow an explicit motion schedule while still responding to camera pose inputs.
- Temporal‑warping training scheme: Repurposes existing multi‑view static datasets to simulate temporal variations, sidestepping the lack of paired “same‑scene‑different‑time” video data.
- CamxTime dataset: The first synthetic collection that provides full coverage of space‑and‑time trajectories for a scene, enabling supervised learning of both controls.
- Improved camera conditioning: Allows the camera to be changed from the very first frame rather than only after a few diffusion steps, yielding smoother viewpoint transitions.
- State‑of‑the‑art results: Demonstrates clear disentanglement of space and time on both real‑world footage and synthetic benchmarks, outperforming prior video‑to‑video generative methods.
Methodology
-
Diffusion backbone – The model builds on a standard video diffusion pipeline (U‑Net with attention) but augments the latent space with two conditioning streams:
- Camera pose embedding (3‑D extrinsics) that tells the network where the virtual camera should be.
- Animation time embedding that encodes the desired point in the motion timeline (e.g., “frame 5 of the original motion” vs. “frame 20”).
-
Temporal‑warping supervision – Since no real dataset offers the same dynamic scene filmed at multiple speeds, the authors take multi‑view static captures, apply synthetic optical‑flow‑based warps to create pseudo‑temporal variations, and train the model to map between these warped sequences.
-
CamxTime synthetic data – Using a graphics engine, they render scenes with fully controllable camera paths and object animations, producing paired video clips that cover every combination of viewpoint and time. This dataset provides a clean signal for learning perfect space‑time disentanglement.
-
Joint training – The model is trained on a mixture of warped real‑world clips and CamxTime renders, balancing realism (from real footage) with precise control (from synthetic data).
-
Inference – At test time, a user supplies:
- A source video (the “reference” dynamics).
- A target camera trajectory (e.g., a 360° orbit).
- A target time schedule (e.g., slow‑motion, speed‑up, or arbitrary frame‑by‑frame mapping).
The diffusion process then generates a new video that respects both inputs.
Results & Findings
- Quantitative: On standard video generation metrics (FID, LPIPS) the model improves by ~15 % over the strongest baselines when evaluated on both synthetic and real‑world test sets.
- Temporal control accuracy: Measured by alignment of generated motion to the prescribed time schedule, SpaceTimePilot achieves a mean absolute error < 0.05 s on CamxTime, indicating tight synchronization.
- Spatial fidelity: Viewpoint changes produce consistent geometry and lighting, with a 0.8 SSIM gain over prior methods that only support camera changes after a few frames.
- User study: Developers asked to edit a video’s viewpoint and speed reported a 4.2/5 average satisfaction score, citing “intuitive control” and “high visual quality”.
Practical Implications
- Content creation pipelines – Filmmakers and game developers can generate new camera angles or re‑time action sequences from a single shoot, dramatically cutting down on costly multi‑camera rigs or reshoots.
- AR/VR experiences – Real‑time re‑rendering of captured scenes from arbitrary viewpoints enables immersive replay or “director’s cut” experiences without pre‑recorded 360° footage.
- Robotics & simulation – Synthetic training data for vision‑based controllers can be diversified along both spatial and temporal axes automatically, improving robustness of perception models.
- Data augmentation – Machine‑learning pipelines that need varied video samples (e.g., action recognition) can use SpaceTimePilot to generate plausible variations without manual labeling.
Limitations & Future Work
- Temporal realism – The warping‑based supervision can introduce subtle artifacts when the source motion is highly non‑linear (e.g., fast sports), limiting perfect slow‑motion fidelity.
- Generalization to unseen dynamics – The model performs best when the source motion resembles patterns seen during training; exotic or highly stochastic motions may degrade quality.
- Compute cost – Like most diffusion models, inference remains relatively heavy (several seconds per second of video on a single GPU), which may hinder real‑time applications.
- Future directions suggested by the authors include: integrating motion‑aware priors (e.g., optical flow consistency), optimizing the diffusion schedule for faster sampling, and expanding the synthetic dataset to cover more complex physical interactions (fluid dynamics, deformable objects).
Authors
- Zhening Huang
- Hyeonho Jeong
- Xuelin Chen
- Yulia Gryaditskaya
- Tuanfeng Y. Wang
- Joan Lasenby
- Chun‑Hao Huang
Paper Information
- arXiv ID: 2512.25075v1
- Categories: cs.CV, cs.AI, cs.RO
- Published: December 31, 2025
- PDF: Download PDF