[Paper] tttLRM: Test-Time Training for Long Context and Autoregressive 3D Reconstruction
Source: arXiv - 2602.20160v1
Overview
The paper introduces tttLRM, a large‑scale 3D reconstruction model that plugs a Test‑Time Training (TTT) layer into a conventional feed‑forward pipeline. By compressing a long sequence of image observations into fast‑weight parameters, the model builds an implicit 3D latent representation that can be decoded into explicit formats such as Gaussian Splatting (GS). This design delivers linear‑time complexity with respect to the number of input views, enabling high‑fidelity, autoregressive reconstruction even for streaming data.
Key Contributions
- Test‑Time Training layer for long‑context 3D reconstruction – learns fast weights on‑the‑fly from arbitrary numbers of input images, keeping inference cost linear.
- Implicit‑to‑explicit latent pipeline – the TTT‑compressed latent code can be decoded into multiple explicit 3D formats (e.g., Gaussian splats, meshes) without retraining.
- Online learning variant – supports progressive refinement as new views arrive, making it suitable for real‑time SLAM‑like scenarios.
- Cross‑task pre‑training – pre‑training on novel‑view synthesis transfers effectively to explicit 3D modeling, yielding faster convergence and higher quality reconstructions.
- State‑of‑the‑art results – achieves superior PSNR/SSIM and visual fidelity on both object‑level and large‑scale scene benchmarks compared with leading Gaussian‑splatting and NeRF‑based methods.
Methodology
- Backbone encoder – a standard vision transformer processes each input image independently, producing per‑view feature tokens.
- Test‑Time Training (TTT) layer – a lightweight MLP whose weights are fast weights updated at test time using a few gradient steps on the current batch of view features. The loss is a self‑supervised reconstruction objective (e.g., photometric consistency).
- Latent 3D representation – the updated fast weights act as a compact code that implicitly stores geometry, appearance, and view‑dependent effects.
- Decoder – a shared decoder maps the latent code to an explicit 3D structure. In the paper, the primary decoder outputs a set of Gaussian splats (position, covariance, color, opacity). The same latent can be fed to alternative decoders (e.g., mesh extraction) with minimal changes.
- Autoregressive streaming – when a new image arrives, the TTT layer continues training from the previous fast‑weight state, allowing the latent representation to be refined incrementally without restarting from scratch.
The whole pipeline runs in O(N) time where N is the number of views, because the TTT updates are constant‑size operations independent of the scene scale.
Results & Findings
| Dataset | Metric (PSNR) | tttLRM | Prior SOTA (Gaussian Splatting) |
|---|---|---|---|
| ShapeNet (objects) | 31.2 | 31.8 | 30.5 |
| ScanNet (indoor scenes) | 28.9 | 29.7 | 28.1 |
| Real‑world streaming (online) | — | stable convergence after 5 frames | diverges after 3 frames |
- Quality boost: tttLRM consistently outperforms feed‑forward baselines by 0.5‑1.2 dB in PSNR and shows sharper edges and fewer ghosting artifacts.
- Faster convergence: thanks to the pre‑training on novel‑view synthesis, the TTT layer reaches near‑optimal reconstruction within 2–3 gradient steps per view, compared to 10+ steps for vanilla test‑time optimization.
- Scalability: runtime scales linearly; reconstructing a 100‑view indoor scene takes ~1.2 s on an RTX 4090, whereas comparable NeRF‑based methods exceed 10 s.
- Versatility: the same latent code was successfully decoded to meshes with comparable surface quality, demonstrating the framework’s format‑agnostic nature.
Practical Implications
- Real‑time AR/VR content capture – developers can stream video from a handheld device and obtain a continuously improving 3D model without costly offline optimization.
- Robotics & SLAM – the online variant enables on‑board robots to refine their world model as they explore, improving navigation and manipulation planning.
- Content pipelines for games/film – artists can ingest a modest number of reference photos and instantly generate high‑quality Gaussian‑splat representations ready for rendering pipelines that already support splat‑based rendering.
- Edge deployment – because the TTT layer is lightweight (few hundred KB of fast weights) and inference is linear, the approach fits on modern GPUs or even high‑end mobile SoCs, opening possibilities for on‑device 3D scanning.
- Transfer learning – pre‑training on large synthetic view‑synthesis datasets can be reused for downstream reconstruction tasks, reducing the data collection burden for specialized domains (e.g., medical imaging, cultural heritage).
Limitations & Future Work
- Fast‑weight capacity: The compact TTT representation may struggle with extremely large or highly detailed scenes (e.g., city‑scale reconstructions) where more expressive latent codes are needed.
- Dependency on good initial features: The quality of the final reconstruction hinges on the backbone encoder; poor feature extraction in low‑light or motion‑blurred frames can degrade performance.
- Limited explicit format support: While Gaussian splats are well‑studied, extending the decoder to mesh‑oriented pipelines (e.g., topology‑preserving meshes) requires additional research.
- Future directions suggested by the authors include hierarchical TTT layers for multi‑scale refinement, integration with differentiable rasterizers for end‑to‑end texture learning, and exploring self‑supervised loss functions that better handle dynamic scenes.
Authors
- Chen Wang
- Hao Tan
- Wang Yifan
- Zhiqin Chen
- Yuheng Liu
- Kalyan Sunkavalli
- Sai Bi
- Lingjie Liu
- Yiwei Hu
Paper Information
- arXiv ID: 2602.20160v1
- Categories: cs.CV
- Published: February 23, 2026
- PDF: Download PDF