[Paper] Spatia: Video Generation with Updatable Spatial Memory

Published: (December 17, 2025 at 01:59 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.15716v1

Overview

The paper introduces Spatia, a video‑generation framework that keeps a persistent 3‑D point‑cloud of the scene as “spatial memory.” By constantly updating this memory with visual‑SLAM techniques, Spatia can synthesize long video sequences that stay spatially coherent while still rendering realistic moving objects. The approach bridges the gap between classic 3‑D reconstruction pipelines and modern generative models, opening the door to controllable, 3‑D‑aware video creation.

Key Contributions

  • Explicit spatial memory: Stores a 3‑D point cloud of the scene that survives across generated clips, acting as a global reference for geometry.
  • Dynamic‑static disentanglement: Separates static background (handled by the spatial memory) from dynamic foreground (generated by a conventional video diffusion/transformer model).
  • Iterative clip‑wise generation & update: Each short clip is generated conditioned on the current memory, then the memory is refined via a visual SLAM module, enabling long‑term consistency.
  • Camera‑controlled synthesis: Because the memory is a true 3‑D representation, users can explicitly steer the virtual camera (pose, trajectory) during generation.
  • 3‑D‑aware interactive editing: Objects can be added, removed, or repositioned in the point cloud, and the model will re‑render the video accordingly.

Methodology

  1. Spatial Memory Initialization – A short seed video (or a single frame) is processed by a SLAM engine to produce an initial sparse point cloud with per‑point color and depth.
  2. Clip‑wise Generation – A generative backbone (e.g., a video diffusion model) receives the current camera pose and the spatial memory as conditioning inputs. It predicts the next few frames, focusing on dynamic elements (people, cars, etc.).
  3. Memory Update – The newly generated frames are fed back into the SLAM module, which refines the point cloud: new static surfaces are added, occluded points are pruned, and colors are updated.
  4. Iterative Loop – Steps 2‑3 repeat for as many clips as needed, allowing the system to produce arbitrarily long videos while the memory accumulates a more complete 3‑D model of the scene.
  5. Control Interfaces – Since the memory is explicit, developers can inject custom camera trajectories or edit the point cloud directly (e.g., moving an object’s points), and the next generation step will respect those changes.

The pipeline is deliberately modular: any off‑the‑shelf SLAM system can be swapped in, and the generative component can be a diffusion model, transformer, or GAN, making it adaptable to existing video‑generation stacks.

Results & Findings

  • Spatial consistency: Quantitative metrics (e.g., PSNR/SSIM across long sequences, and a newly proposed “3‑D consistency score”) show a 15‑20 % improvement over baseline video diffusion models that lack memory.
  • Temporal stability: Flicker and jitter are dramatically reduced; user studies report a 30 % higher perceived smoothness.
  • Camera control fidelity: When users specify a novel camera path, the generated frames follow the intended geometry with sub‑pixel reprojection error, something prior models struggle with.
  • Interactive editing: Experiments where objects are moved in the point cloud demonstrate that the model can seamlessly re‑render the scene without noticeable artifacts, confirming the dynamic‑static split works in practice.

Practical Implications

  • Game & VR content pipelines – Developers can generate background video assets that stay geometrically consistent across long play sessions, reducing the need for hand‑crafted level geometry.
  • Synthetic data for perception – Autonomous‑driving and robotics teams can produce endless, photorealistic video streams with controllable camera motion and accurate 3‑D scene layout, improving training data diversity.
  • Film & VFX pre‑visualization – Directors can prototype camera moves and scene edits quickly, using the memory as a “digital set” that updates automatically as the story evolves.
  • AR/Live‑stream overlays – Real‑time applications could inject generated dynamic elements (e.g., virtual characters) into a live video while preserving the static environment’s geometry, thanks to the continuously updated point cloud.

Limitations & Future Work

  • Memory scalability – The point cloud grows with scene size; current experiments cap at modest indoor/outdoor environments. Efficient pruning or hierarchical representations are needed for city‑scale scenes.
  • SLAM dependency – Errors in the visual‑SLAM front‑end (e.g., drift, poor depth in low‑texture areas) propagate to the generated video. Robustifying the SLAM component or learning a correction module is an open direction.
  • Dynamic object geometry – While dynamics are handled by the generative model, the system does not model deformable 3‑D shapes explicitly, limiting realism for complex motions (e.g., cloth).
  • Real‑time performance – The iterative generate‑update loop is still computationally heavy; future work could explore lightweight diffusion variants or GPU‑accelerated SLAM to approach interactive speeds.

Authors

  • Jinjing Zhao
  • Fangyun Wei
  • Zhening Liu
  • Hongyang Zhang
  • Chang Xu
  • Yan Lu

Paper Information

  • arXiv ID: 2512.15716v1
  • Categories: cs.CV, cs.AI
  • Published: December 17, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »