[Paper] LASER: Layer-wise Scale Alignment for Training-Free Streaming 4D Reconstruction

Published: 3 days ago (December 15, 2025 at 01:59 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.13680v1

Overview

The paper introduces LASER (Layer‑wise Scale Alignment for Training‑free Streaming 4D Reconstruction), a framework that turns high‑quality offline 3‑D reconstruction models into real‑time streaming systems without any additional training. By solving a subtle “layer‑scale” mismatch that occurs when stitching together depth predictions from consecutive video windows, LASER delivers offline‑level accuracy at interactive speeds (≈14 fps) and modest GPU memory (≈6 GB).

Key Contributions

Training‑free streaming pipeline – converts any feed‑forward offline reconstructor (e.g., VGGT, π³) into a streaming system without re‑training or fine‑tuning.
Layer‑wise scale alignment – a novel per‑depth‑layer scaling strategy that resolves the monocular scale ambiguity across temporal windows, outperforming naïve Sim(3) alignment.
Memory‑efficient design – operates with linear‑time and linear‑memory complexity, enabling kilometer‑scale video processing on a single RTX A6000.
State‑of‑the‑art results – achieves the best published camera‑pose and point‑cloud quality among streaming methods while maintaining real‑time throughput.
Open‑source release – code, pretrained models, and demo videos are publicly available.

Methodology

Base Offline Model – LASER starts from any existing feed‑forward 4‑D reconstructor that predicts per‑pixel depth and camera pose for a short video clip (a “window”). These models are typically trained on large static datasets and excel at geometry quality but assume the whole clip is available at once.
Temporal Windowing – The input video is split into overlapping windows (e.g., 8‑frame chunks). Each window is processed independently by the offline model, producing depth maps and poses for its frames.
Layer Segmentation – Within each depth map, pixels are grouped into a small number of discrete depth “layers” (e.g., near, mid, far). This is done by simple quantization of the predicted depth values.
Scale Factor Estimation – For every layer, LASER computes a scale factor that best aligns the 3‑D points of the current window with those of the previous window. The alignment is solved by a closed‑form least‑squares formulation that respects the Sim(3) similarity transform per layer.
Propagation Across Time – The per‑layer scales are propagated forward, smoothing them over adjacent windows to avoid jitter. The final camera poses and point clouds are then re‑scaled accordingly, yielding a globally consistent reconstruction.
Streaming Output – As each window finishes, the aligned points are streamed out, and only a small buffer of recent frames is kept in GPU memory, keeping the memory footprint linear in the window size.

The whole pipeline is training‑free: it only requires the pretrained offline model and a few minutes of offline calibration to set the number of layers and the smoothing parameters.

Results & Findings

Metric	Offline (VGGT)	Prior Streaming (Causal‑Attn)	LASER
Camera pose RMSE (m)	0.032	0.058	0.034
Point‑cloud F‑score @1 cm	0.71	0.55	0.70
Throughput (fps)	2 (offline)	10	14
Peak GPU memory (GB)	12	8	6

Scale alignment matters: naïve Sim(3) alignment across whole frames leaves a systematic drift in depth, especially for far‑away layers. Layer‑wise scaling reduces this drift by > 70 %.
Linear memory scaling: Memory grows with the window length, not with the total video length, enabling reconstruction of > 2 km of road footage on a single GPU.
Robustness: The method works on diverse scenes (urban streets, indoor corridors, aerial footage) without any scene‑specific tuning.

Practical Implications

Real‑time mapping for robotics & AR – Drones, autonomous cars, or handheld AR devices can now obtain high‑fidelity 3‑D maps on‑the‑fly without the heavy training pipelines that current streaming methods demand.
Cost‑effective deployment – Since LASER re‑uses existing offline models, companies can leverage their already‑trained networks and avoid expensive retraining on streaming data.
Scalable cloud services – Streaming reconstruction can be offered as a SaaS product; the low memory footprint means a single GPU can serve many concurrent video streams.
Rapid prototyping – Researchers can plug any new offline reconstructor into LASER and instantly evaluate its streaming performance, accelerating the iteration cycle.

Limitations & Future Work

Layer granularity trade‑off – Choosing too few layers can leave residual scale errors; too many layers increase computational overhead. Adaptive layer selection is an open problem.
Assumes moderate motion – Very fast camera motion or extreme depth discontinuities can break the linear scale propagation; integrating motion‑aware weighting could help.
Only monocular depth – LASER currently works with monocular depth predictions; extending to stereo or multi‑view depth could further improve robustness.
Evaluation on extreme scales – While kilometer‑scale tests are shown, handling city‑wide reconstructions (> 10 km) may require hierarchical buffering strategies, which the authors plan to explore.

Authors

Tianye Ding
Yiming Xie
Yiqing Liang
Moitreya Chatterjee
Pedro Miraldo
Huaizu Jiang

Paper Information

arXiv ID: 2512.13680v1
Categories: cs.CV
Published: December 15, 2025
PDF: Download PDF

[Paper] LASER: Layer-wise Scale Alignment for Training-Free Streaming 4D Reconstruction

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Spatia: Video Generation with Updatable Spatial Memory

[Paper] In Pursuit of Pixel Supervision for Visual Pre-training

[Paper] DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models

[Paper] Gaussian Pixel Codec Avatars: A Hybrid Representation for Efficient Rendering