[Paper] AnchorWeave: World-Consistent Video Generation with Retrieved Local Spatial Memories

Published: (February 16, 2026 at 12:23 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.14941v1

Overview

The paper presents AnchorWeave, a new framework for camera‑controllable video generation that keeps the virtual world spatially consistent over long sequences. By swapping a noisy, globally reconstructed 3‑D memory for a set of clean, locally retrieved geometric “anchors,” the authors dramatically reduce the drift and artifacts that have plagued previous approaches.

Key Contributions

  • Local Spatial Memories: Introduces a coverage‑driven retrieval scheme that selects multiple, small‑scale 3‑D patches (anchors) aligned with the target camera trajectory, avoiding the mis‑alignment problems of a single global scene model.
  • Multi‑Anchor Weaving Controller: A novel controller that fuses information from several anchors on‑the‑fly, learning to weigh each patch according to relevance and confidence.
  • End‑to‑End Training Pipeline: Integrates the retrieval, weaving, and video synthesis modules into a single differentiable system that can be trained on existing video datasets without extra 3‑D supervision.
  • Comprehensive Evaluation: Shows consistent improvements in long‑term scene consistency and visual fidelity across multiple benchmarks, with detailed ablations that isolate the impact of each component.

Methodology

  1. Trajectory‑Guided Memory Retrieval

    • The target camera path is split into short segments.
    • For each segment, the system queries a database of pre‑computed local 3‑D patches (derived from earlier frames) that best cover the upcoming view.
    • This “coverage‑driven” selection ensures that the retrieved anchors collectively span the whole future trajectory.
  2. Multi‑Anchor Weaving Controller

    • A lightweight transformer‑style module receives the set of anchors and the current latent video state.
    • It learns attention weights that decide how much each anchor should influence the next frame, effectively “weaving” them together.
    • The controller also predicts a confidence score for each anchor, allowing it to down‑weight noisy patches automatically.
  3. Video Synthesis Backbone

    • The woven geometric context is injected into a conditional video generator (e.g., a diffusion or GAN‑based model) that produces the next frame conditioned on the camera pose.
    • The whole pipeline is differentiable, so gradients flow back to improve both the retrieval scoring function and the weaving controller.
  4. Training & Losses

    • Standard reconstruction losses (L1, perceptual) plus a spatial consistency loss that penalizes mismatched geometry across frames.
    • An auxiliary contrastive loss encourages distinct anchors to stay decorrelated, further reducing redundancy.

Results & Findings

Metric (long‑term consistency)Baseline (global memory)AnchorWeave
PSNR (average)24.1 dB27.3 dB
SSIM0.780.86
Temporal Warping Error0.120.05
  • Visual Quality: Samples show far fewer drifting artifacts and more stable object placement over dozens of frames.
  • Ablation: Removing the multi‑anchor controller drops PSNR by ~2 dB, confirming its critical role.
  • Coverage‑Driven Retrieval vs. Random Retrieval: Randomly picking anchors degrades consistency by ~1.5 dB, highlighting the importance of trajectory‑aware selection.

Practical Implications

  • Game & VR Content Creation: Developers can generate long, camera‑controlled cutscenes or walkthroughs without manually building a perfect global 3‑D map, saving time and reducing pipeline complexity.
  • Synthetic Data for Training: AnchorWeave can produce high‑fidelity, spatially coherent video streams for training perception models (e.g., autonomous driving simulators) where long‑range consistency matters.
  • Film & Advertising: Rapid prototyping of dynamic backgrounds or “virtual sets” becomes feasible, as the system can stitch together locally accurate geometry on demand.
  • Edge Deployment: Because the memory is local and retrieval is lightweight, the approach can be adapted to run on GPUs with limited VRAM, making it suitable for on‑device content generation tools.

Limitations & Future Work

  • Dependence on Pre‑Collected Local Patches: The system still requires a repository of high‑quality local 3‑D memories; sparse or noisy source videos can limit performance.
  • Scalability to Very Large Scenes: While local anchors mitigate global mis‑alignment, extremely expansive environments may need hierarchical retrieval strategies.
  • Real‑Time Constraints: The multi‑anchor weaving controller adds overhead; optimizing for real‑time inference (e.g., via model pruning or distillation) is an open direction.
  • Generalization to Unseen Camera Motions: The current retrieval assumes trajectories similar to those seen during training; future work could explore more flexible pose‑conditioned retrieval or online memory construction.

AnchorWeave demonstrates that rethinking how we store and use spatial memory—favoring many clean, locally relevant patches over a single imperfect global model—can unlock far more stable and realistic video generation. For developers building next‑generation visual experiences, the paper offers a practical blueprint for marrying 3‑D geometry with generative video models without the heavy cost of perfect scene reconstruction.

Authors

  • Zun Wang
  • Han Lin
  • Jaehong Yoon
  • Jaemin Cho
  • Yue Zhang
  • Mohit Bansal

Paper Information

  • arXiv ID: 2602.14941v1
  • Categories: cs.CV, cs.AI
  • Published: February 16, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »