[Paper] ZipMap: Linear-Time Stateful 3D Reconstruction with Test-Time Training

Published: (March 4, 2026 at 01:49 PM EST)
5 min read
Source: arXiv

Source: arXiv - 2603.04385v1

Overview

The paper presents ZipMap, a new 3D reconstruction model that processes an entire photo collection in linear time, rather than the quadratic cost of current transformer‑based methods. By compressing the whole scene into a compact hidden state during a single forward pass, ZipMap can rebuild 3‑D geometry from hundreds of images in just a few seconds on a modern GPU—making high‑quality reconstruction practical for real‑time and large‑scale applications.

Key Contributions

  • Linear‑time, bidirectional reconstruction: Achieves O(N) complexity with respect to the number of input images, while preserving (or improving) the accuracy of quadratic‑time baselines such as VGGT and π³.
  • Stateful hidden scene representation: Introduces a “scene‑state” vector that aggregates information from all views, enabling instant queries of any viewpoint after the initial pass.
  • Test‑time training (TTT) layers: Uses lightweight, on‑the‑fly adaptation layers that “zip” the image collection into the hidden state without back‑propagating through the whole network.
  • Real‑time streaming extension: Demonstrates that new frames can be appended to the existing state with negligible overhead, supporting live SLAM‑like scenarios.
  • Speed benchmark: Reconstructs > 700 frames in < 10 s on a single NVIDIA H100 GPU—over 20× faster than the previous state‑of‑the‑art.

Methodology

  1. Input preprocessing – The system receives a set of calibrated RGB images (camera poses are either known or estimated beforehand).
  2. Feature extraction – A shallow CNN extracts per‑image feature maps, which are then flattened into token sequences.
  3. Test‑time training layers – Small, learnable adapters are inserted after the feature extractor. During inference they are fine‑tuned on the current image batch for a few gradient steps, allowing the network to adapt to the specific lighting, texture, and scene layout of the collection.
  4. Zipping into a hidden state – The adapted tokens are passed through a linear‑time transformer encoder that aggregates information bidirectionally (forward and backward across the image order). The output is a single fixed‑size vector – the scene state.
  5. 3‑D decoding – A lightweight decoder takes the scene state together with any desired camera pose and predicts depth, occupancy, or signed‑distance values for that view. Because the scene state already encodes the whole collection, the decoder runs in constant time per query.
  6. Streaming update – When a new image arrives, it is processed through steps 2‑4 and the hidden state is updated via a simple additive rule, avoiding a full recomputation.

The overall pipeline requires only one full forward pass over the dataset, after which any number of view‑specific reconstructions can be generated instantly.

Results & Findings

MetricZipMapVGGT (quadratic)π³ (quadratic)
Reconstruction error (RMSE)0.71 m0.78 m0.80 m
Runtime (700 frames)9.8 s210 s185 s
Memory footprint~2 GB~12 GB~10 GB
Real‑time query latency (per view)< 5 ms~150 ms~130 ms
  • Accuracy: ZipMap matches or slightly outperforms the best quadratic baselines on standard indoor and outdoor datasets (ScanNet, Tanks‑&‑Temples).
  • Speed: The linear‑time design yields a > 20× speedup, making it feasible to run on‑device or in cloud services with tight latency budgets.
  • Scalability: Memory usage grows only with the hidden state size (fixed), not with the number of input images, allowing reconstruction of thousands of frames on a single GPU.

Practical Implications

  • Rapid prototyping for AR/VR – Developers can generate high‑fidelity scene meshes on‑the‑fly, enabling dynamic world‑building in mixed‑reality apps without pre‑processing large photo sets.
  • Cloud‑based 3‑D services – SaaS platforms that accept user‑uploaded photo collections (e.g., real‑estate tours, e‑commerce product scans) can now deliver results in seconds rather than minutes, reducing compute costs and improving user experience.
  • Robotics & autonomous navigation – The streaming variant lets a robot continuously update a compact scene representation as it moves, supporting SLAM pipelines that need both speed and global consistency.
  • Edge deployment – Because the heavy lifting is done in a single forward pass and the per‑view decoder is lightweight, ZipMap can be split between a powerful edge GPU (e.g., Jetson AGX) for the zip step and a CPU for on‑demand queries.

Limitations & Future Work

  • Dependence on accurate camera poses – The current implementation assumes reasonably good pose estimates; large pose errors degrade the hidden state quality.
  • Test‑time training overhead – Although lightweight, the TTT steps add a few milliseconds per batch, which may be noticeable on low‑power devices.
  • Scene complexity bound – A fixed‑size hidden state may struggle with extremely large or highly detailed environments; scaling the state dimension or using hierarchical states is an open direction.
  • Generalization to novel modalities – Extending ZipMap to handle multimodal inputs (e.g., LiDAR, depth sensors) or to perform semantic segmentation alongside geometry remains future work.

Overall, ZipMap demonstrates that stateful feed‑forward models can break the quadratic bottleneck that has limited transformer‑based 3‑D reconstruction, opening the door to fast, scalable, and interactive geometry creation for a wide range of developer‑focused applications.

Authors

  • Haian Jin
  • Rundi Wu
  • Tianyuan Zhang
  • Ruiqi Gao
  • Jonathan T. Barron
  • Noah Snavely
  • Aleksander Holynski

Paper Information

  • arXiv ID: 2603.04385v1
  • Categories: cs.CV, cs.AI, cs.LG
  • Published: March 4, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »