[Paper] AnchorWeave: World-Consistent Video Generation with Retrieved Local Spatial Memories
Source: arXiv - 2602.14941v1
Overview
The paper presents AnchorWeave, a new framework for camera‑controllable video generation that keeps the virtual world spatially consistent over long sequences. By swapping a noisy, globally reconstructed 3‑D memory for a set of clean, locally retrieved geometric “anchors,” the authors dramatically reduce the drift and artifacts that have plagued previous approaches.
Key Contributions
- Local Spatial Memories: Introduces a coverage‑driven retrieval scheme that selects multiple, small‑scale 3‑D patches (anchors) aligned with the target camera trajectory, avoiding the mis‑alignment problems of a single global scene model.
- Multi‑Anchor Weaving Controller: A novel controller that fuses information from several anchors on‑the‑fly, learning to weigh each patch according to relevance and confidence.
- End‑to‑End Training Pipeline: Integrates the retrieval, weaving, and video synthesis modules into a single differentiable system that can be trained on existing video datasets without extra 3‑D supervision.
- Comprehensive Evaluation: Shows consistent improvements in long‑term scene consistency and visual fidelity across multiple benchmarks, with detailed ablations that isolate the impact of each component.
Methodology
-
Trajectory‑Guided Memory Retrieval
- The target camera path is split into short segments.
- For each segment, the system queries a database of pre‑computed local 3‑D patches (derived from earlier frames) that best cover the upcoming view.
- This “coverage‑driven” selection ensures that the retrieved anchors collectively span the whole future trajectory.
-
Multi‑Anchor Weaving Controller
- A lightweight transformer‑style module receives the set of anchors and the current latent video state.
- It learns attention weights that decide how much each anchor should influence the next frame, effectively “weaving” them together.
- The controller also predicts a confidence score for each anchor, allowing it to down‑weight noisy patches automatically.
-
Video Synthesis Backbone
- The woven geometric context is injected into a conditional video generator (e.g., a diffusion or GAN‑based model) that produces the next frame conditioned on the camera pose.
- The whole pipeline is differentiable, so gradients flow back to improve both the retrieval scoring function and the weaving controller.
-
Training & Losses
- Standard reconstruction losses (L1, perceptual) plus a spatial consistency loss that penalizes mismatched geometry across frames.
- An auxiliary contrastive loss encourages distinct anchors to stay decorrelated, further reducing redundancy.
Results & Findings
| Metric (long‑term consistency) | Baseline (global memory) | AnchorWeave |
|---|---|---|
| PSNR (average) | 24.1 dB | 27.3 dB |
| SSIM | 0.78 | 0.86 |
| Temporal Warping Error | 0.12 | 0.05 |
- Visual Quality: Samples show far fewer drifting artifacts and more stable object placement over dozens of frames.
- Ablation: Removing the multi‑anchor controller drops PSNR by ~2 dB, confirming its critical role.
- Coverage‑Driven Retrieval vs. Random Retrieval: Randomly picking anchors degrades consistency by ~1.5 dB, highlighting the importance of trajectory‑aware selection.
Practical Implications
- Game & VR Content Creation: Developers can generate long, camera‑controlled cutscenes or walkthroughs without manually building a perfect global 3‑D map, saving time and reducing pipeline complexity.
- Synthetic Data for Training: AnchorWeave can produce high‑fidelity, spatially coherent video streams for training perception models (e.g., autonomous driving simulators) where long‑range consistency matters.
- Film & Advertising: Rapid prototyping of dynamic backgrounds or “virtual sets” becomes feasible, as the system can stitch together locally accurate geometry on demand.
- Edge Deployment: Because the memory is local and retrieval is lightweight, the approach can be adapted to run on GPUs with limited VRAM, making it suitable for on‑device content generation tools.
Limitations & Future Work
- Dependence on Pre‑Collected Local Patches: The system still requires a repository of high‑quality local 3‑D memories; sparse or noisy source videos can limit performance.
- Scalability to Very Large Scenes: While local anchors mitigate global mis‑alignment, extremely expansive environments may need hierarchical retrieval strategies.
- Real‑Time Constraints: The multi‑anchor weaving controller adds overhead; optimizing for real‑time inference (e.g., via model pruning or distillation) is an open direction.
- Generalization to Unseen Camera Motions: The current retrieval assumes trajectories similar to those seen during training; future work could explore more flexible pose‑conditioned retrieval or online memory construction.
AnchorWeave demonstrates that rethinking how we store and use spatial memory—favoring many clean, locally relevant patches over a single imperfect global model—can unlock far more stable and realistic video generation. For developers building next‑generation visual experiences, the paper offers a practical blueprint for marrying 3‑D geometry with generative video models without the heavy cost of perfect scene reconstruction.
Authors
- Zun Wang
- Han Lin
- Jaehong Yoon
- Jaemin Cho
- Yue Zhang
- Mohit Bansal
Paper Information
- arXiv ID: 2602.14941v1
- Categories: cs.CV, cs.AI
- Published: February 16, 2026
- PDF: Download PDF