[Paper] LASER: Layer-wise Scale Alignment for Training-Free Streaming 4D Reconstruction

Published: (December 15, 2025 at 01:59 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.13680v1

Overview

The paper introduces LASER (Layer‑wise Scale Alignment for Training‑free Streaming 4D Reconstruction), a framework that turns high‑quality offline 3‑D reconstruction models into real‑time streaming systems without any additional training. By solving a subtle “layer‑scale” mismatch that occurs when stitching together depth predictions from consecutive video windows, LASER delivers offline‑level accuracy at interactive speeds (≈14 fps) and modest GPU memory (≈6 GB).

Key Contributions

  • Training‑free streaming pipeline – converts any feed‑forward offline reconstructor (e.g., VGGT, π³) into a streaming system without re‑training or fine‑tuning.
  • Layer‑wise scale alignment – a novel per‑depth‑layer scaling strategy that resolves the monocular scale ambiguity across temporal windows, outperforming naïve Sim(3) alignment.
  • Memory‑efficient design – operates with linear‑time and linear‑memory complexity, enabling kilometer‑scale video processing on a single RTX A6000.
  • State‑of‑the‑art results – achieves the best published camera‑pose and point‑cloud quality among streaming methods while maintaining real‑time throughput.
  • Open‑source release – code, pretrained models, and demo videos are publicly available.

Methodology

  1. Base Offline Model – LASER starts from any existing feed‑forward 4‑D reconstructor that predicts per‑pixel depth and camera pose for a short video clip (a “window”). These models are typically trained on large static datasets and excel at geometry quality but assume the whole clip is available at once.

  2. Temporal Windowing – The input video is split into overlapping windows (e.g., 8‑frame chunks). Each window is processed independently by the offline model, producing depth maps and poses for its frames.

  3. Layer Segmentation – Within each depth map, pixels are grouped into a small number of discrete depth “layers” (e.g., near, mid, far). This is done by simple quantization of the predicted depth values.

  4. Scale Factor Estimation – For every layer, LASER computes a scale factor that best aligns the 3‑D points of the current window with those of the previous window. The alignment is solved by a closed‑form least‑squares formulation that respects the Sim(3) similarity transform per layer.

  5. Propagation Across Time – The per‑layer scales are propagated forward, smoothing them over adjacent windows to avoid jitter. The final camera poses and point clouds are then re‑scaled accordingly, yielding a globally consistent reconstruction.

  6. Streaming Output – As each window finishes, the aligned points are streamed out, and only a small buffer of recent frames is kept in GPU memory, keeping the memory footprint linear in the window size.

The whole pipeline is training‑free: it only requires the pretrained offline model and a few minutes of offline calibration to set the number of layers and the smoothing parameters.

Results & Findings

MetricOffline (VGGT)Prior Streaming (Causal‑Attn)LASER
Camera pose RMSE (m)0.0320.0580.034
Point‑cloud F‑score @1 cm0.710.550.70
Throughput (fps)2 (offline)1014
Peak GPU memory (GB)1286
  • Scale alignment matters: naïve Sim(3) alignment across whole frames leaves a systematic drift in depth, especially for far‑away layers. Layer‑wise scaling reduces this drift by > 70 %.
  • Linear memory scaling: Memory grows with the window length, not with the total video length, enabling reconstruction of > 2 km of road footage on a single GPU.
  • Robustness: The method works on diverse scenes (urban streets, indoor corridors, aerial footage) without any scene‑specific tuning.

Practical Implications

  • Real‑time mapping for robotics & AR – Drones, autonomous cars, or handheld AR devices can now obtain high‑fidelity 3‑D maps on‑the‑fly without the heavy training pipelines that current streaming methods demand.
  • Cost‑effective deployment – Since LASER re‑uses existing offline models, companies can leverage their already‑trained networks and avoid expensive retraining on streaming data.
  • Scalable cloud services – Streaming reconstruction can be offered as a SaaS product; the low memory footprint means a single GPU can serve many concurrent video streams.
  • Rapid prototyping – Researchers can plug any new offline reconstructor into LASER and instantly evaluate its streaming performance, accelerating the iteration cycle.

Limitations & Future Work

  • Layer granularity trade‑off – Choosing too few layers can leave residual scale errors; too many layers increase computational overhead. Adaptive layer selection is an open problem.
  • Assumes moderate motion – Very fast camera motion or extreme depth discontinuities can break the linear scale propagation; integrating motion‑aware weighting could help.
  • Only monocular depth – LASER currently works with monocular depth predictions; extending to stereo or multi‑view depth could further improve robustness.
  • Evaluation on extreme scales – While kilometer‑scale tests are shown, handling city‑wide reconstructions (> 10 km) may require hierarchical buffering strategies, which the authors plan to explore.

Authors

  • Tianye Ding
  • Yiming Xie
  • Yiqing Liang
  • Moitreya Chatterjee
  • Pedro Miraldo
  • Huaizu Jiang

Paper Information

  • arXiv ID: 2512.13680v1
  • Categories: cs.CV
  • Published: December 15, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »