[Paper] AnchorWeave: World-Consistent Video Generation with Retrieved Local Spatial Memories

Published: 3 days ago (February 16, 2026 at 12:23 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.14941v1

Overview

The paper presents AnchorWeave, a new framework for camera‑controllable video generation that keeps the virtual world spatially consistent over long sequences. By swapping a noisy, globally reconstructed 3‑D memory for a set of clean, locally retrieved geometric “anchors,” the authors dramatically reduce the drift and artifacts that have plagued previous approaches.

Key Contributions

Local Spatial Memories: Introduces a coverage‑driven retrieval scheme that selects multiple, small‑scale 3‑D patches (anchors) aligned with the target camera trajectory, avoiding the mis‑alignment problems of a single global scene model.
Multi‑Anchor Weaving Controller: A novel controller that fuses information from several anchors on‑the‑fly, learning to weigh each patch according to relevance and confidence.
End‑to‑End Training Pipeline: Integrates the retrieval, weaving, and video synthesis modules into a single differentiable system that can be trained on existing video datasets without extra 3‑D supervision.
Comprehensive Evaluation: Shows consistent improvements in long‑term scene consistency and visual fidelity across multiple benchmarks, with detailed ablations that isolate the impact of each component.

Methodology

Trajectory‑Guided Memory Retrieval
- The target camera path is split into short segments.
- For each segment, the system queries a database of pre‑computed local 3‑D patches (derived from earlier frames) that best cover the upcoming view.
- This “coverage‑driven” selection ensures that the retrieved anchors collectively span the whole future trajectory.
Multi‑Anchor Weaving Controller
- A lightweight transformer‑style module receives the set of anchors and the current latent video state.
- It learns attention weights that decide how much each anchor should influence the next frame, effectively “weaving” them together.
- The controller also predicts a confidence score for each anchor, allowing it to down‑weight noisy patches automatically.
Video Synthesis Backbone
- The woven geometric context is injected into a conditional video generator (e.g., a diffusion or GAN‑based model) that produces the next frame conditioned on the camera pose.
- The whole pipeline is differentiable, so gradients flow back to improve both the retrieval scoring function and the weaving controller.
Training & Losses
- Standard reconstruction losses (L1, perceptual) plus a spatial consistency loss that penalizes mismatched geometry across frames.
- An auxiliary contrastive loss encourages distinct anchors to stay decorrelated, further reducing redundancy.

Results & Findings

Metric (long‑term consistency)	Baseline (global memory)	AnchorWeave
PSNR (average)	24.1 dB	27.3 dB
SSIM	0.78	0.86
Temporal Warping Error	0.12	0.05

Visual Quality: Samples show far fewer drifting artifacts and more stable object placement over dozens of frames.
Ablation: Removing the multi‑anchor controller drops PSNR by ~2 dB, confirming its critical role.
Coverage‑Driven Retrieval vs. Random Retrieval: Randomly picking anchors degrades consistency by ~1.5 dB, highlighting the importance of trajectory‑aware selection.

Practical Implications

Game & VR Content Creation: Developers can generate long, camera‑controlled cutscenes or walkthroughs without manually building a perfect global 3‑D map, saving time and reducing pipeline complexity.
Synthetic Data for Training: AnchorWeave can produce high‑fidelity, spatially coherent video streams for training perception models (e.g., autonomous driving simulators) where long‑range consistency matters.
Film & Advertising: Rapid prototyping of dynamic backgrounds or “virtual sets” becomes feasible, as the system can stitch together locally accurate geometry on demand.
Edge Deployment: Because the memory is local and retrieval is lightweight, the approach can be adapted to run on GPUs with limited VRAM, making it suitable for on‑device content generation tools.

Limitations & Future Work

Dependence on Pre‑Collected Local Patches: The system still requires a repository of high‑quality local 3‑D memories; sparse or noisy source videos can limit performance.
Scalability to Very Large Scenes: While local anchors mitigate global mis‑alignment, extremely expansive environments may need hierarchical retrieval strategies.
Real‑Time Constraints: The multi‑anchor weaving controller adds overhead; optimizing for real‑time inference (e.g., via model pruning or distillation) is an open direction.
Generalization to Unseen Camera Motions: The current retrieval assumes trajectories similar to those seen during training; future work could explore more flexible pose‑conditioned retrieval or online memory construction.

AnchorWeave demonstrates that rethinking how we store and use spatial memory—favoring many clean, locally relevant patches over a single imperfect global model—can unlock far more stable and realistic video generation. For developers building next‑generation visual experiences, the paper offers a practical blueprint for marrying 3‑D geometry with generative video models without the heavy cost of perfect scene reconstruction.

Authors

Zun Wang
Han Lin
Jaehong Yoon
Jaemin Cho
Yue Zhang
Mohit Bansal

Paper Information

arXiv ID: 2602.14941v1
Categories: cs.CV, cs.AI
Published: February 16, 2026
PDF: Download PDF

[Paper] AnchorWeave: World-Consistent Video Generation with Retrieved Local Spatial Memories

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Are Object-Centric Representations Better At Compositional Generalization?

[Paper] Explainable AI: Context-Aware Layer-Wise Integrated Gradients for Explaining Transformer Models

[Paper] B-DENSE: Branching For Dense Ensemble Network Learning

[Paper] Dex4D: Task-Agnostic Point Track Policy for Sim-to-Real Dexterous Manipulation