[Paper] LASER: Layer-wise Scale Alignment for Training-Free Streaming 4D Reconstruction
Source: arXiv - 2512.13680v1
Overview
The paper introduces LASER (Layer‑wise Scale Alignment for Training‑free Streaming 4D Reconstruction), a framework that turns high‑quality offline 3‑D reconstruction models into real‑time streaming systems without any additional training. By solving a subtle “layer‑scale” mismatch that occurs when stitching together depth predictions from consecutive video windows, LASER delivers offline‑level accuracy at interactive speeds (≈14 fps) and modest GPU memory (≈6 GB).
Key Contributions
- Training‑free streaming pipeline – converts any feed‑forward offline reconstructor (e.g., VGGT, π³) into a streaming system without re‑training or fine‑tuning.
- Layer‑wise scale alignment – a novel per‑depth‑layer scaling strategy that resolves the monocular scale ambiguity across temporal windows, outperforming naïve Sim(3) alignment.
- Memory‑efficient design – operates with linear‑time and linear‑memory complexity, enabling kilometer‑scale video processing on a single RTX A6000.
- State‑of‑the‑art results – achieves the best published camera‑pose and point‑cloud quality among streaming methods while maintaining real‑time throughput.
- Open‑source release – code, pretrained models, and demo videos are publicly available.
Methodology
-
Base Offline Model – LASER starts from any existing feed‑forward 4‑D reconstructor that predicts per‑pixel depth and camera pose for a short video clip (a “window”). These models are typically trained on large static datasets and excel at geometry quality but assume the whole clip is available at once.
-
Temporal Windowing – The input video is split into overlapping windows (e.g., 8‑frame chunks). Each window is processed independently by the offline model, producing depth maps and poses for its frames.
-
Layer Segmentation – Within each depth map, pixels are grouped into a small number of discrete depth “layers” (e.g., near, mid, far). This is done by simple quantization of the predicted depth values.
-
Scale Factor Estimation – For every layer, LASER computes a scale factor that best aligns the 3‑D points of the current window with those of the previous window. The alignment is solved by a closed‑form least‑squares formulation that respects the Sim(3) similarity transform per layer.
-
Propagation Across Time – The per‑layer scales are propagated forward, smoothing them over adjacent windows to avoid jitter. The final camera poses and point clouds are then re‑scaled accordingly, yielding a globally consistent reconstruction.
-
Streaming Output – As each window finishes, the aligned points are streamed out, and only a small buffer of recent frames is kept in GPU memory, keeping the memory footprint linear in the window size.
The whole pipeline is training‑free: it only requires the pretrained offline model and a few minutes of offline calibration to set the number of layers and the smoothing parameters.
Results & Findings
| Metric | Offline (VGGT) | Prior Streaming (Causal‑Attn) | LASER |
|---|---|---|---|
| Camera pose RMSE (m) | 0.032 | 0.058 | 0.034 |
| Point‑cloud F‑score @1 cm | 0.71 | 0.55 | 0.70 |
| Throughput (fps) | 2 (offline) | 10 | 14 |
| Peak GPU memory (GB) | 12 | 8 | 6 |
- Scale alignment matters: naïve Sim(3) alignment across whole frames leaves a systematic drift in depth, especially for far‑away layers. Layer‑wise scaling reduces this drift by > 70 %.
- Linear memory scaling: Memory grows with the window length, not with the total video length, enabling reconstruction of > 2 km of road footage on a single GPU.
- Robustness: The method works on diverse scenes (urban streets, indoor corridors, aerial footage) without any scene‑specific tuning.
Practical Implications
- Real‑time mapping for robotics & AR – Drones, autonomous cars, or handheld AR devices can now obtain high‑fidelity 3‑D maps on‑the‑fly without the heavy training pipelines that current streaming methods demand.
- Cost‑effective deployment – Since LASER re‑uses existing offline models, companies can leverage their already‑trained networks and avoid expensive retraining on streaming data.
- Scalable cloud services – Streaming reconstruction can be offered as a SaaS product; the low memory footprint means a single GPU can serve many concurrent video streams.
- Rapid prototyping – Researchers can plug any new offline reconstructor into LASER and instantly evaluate its streaming performance, accelerating the iteration cycle.
Limitations & Future Work
- Layer granularity trade‑off – Choosing too few layers can leave residual scale errors; too many layers increase computational overhead. Adaptive layer selection is an open problem.
- Assumes moderate motion – Very fast camera motion or extreme depth discontinuities can break the linear scale propagation; integrating motion‑aware weighting could help.
- Only monocular depth – LASER currently works with monocular depth predictions; extending to stereo or multi‑view depth could further improve robustness.
- Evaluation on extreme scales – While kilometer‑scale tests are shown, handling city‑wide reconstructions (> 10 km) may require hierarchical buffering strategies, which the authors plan to explore.
Authors
- Tianye Ding
- Yiming Xie
- Yiqing Liang
- Moitreya Chatterjee
- Pedro Miraldo
- Huaizu Jiang
Paper Information
- arXiv ID: 2512.13680v1
- Categories: cs.CV
- Published: December 15, 2025
- PDF: Download PDF