[Paper] SCE-SLAM: Scale-Consistent Monocular SLAM via Scene Coordinate Embeddings

Published: (January 14, 2026 at 12:57 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.09665v1

Overview

Monocular SLAM (Simultaneous Localization and Mapping) lets a single camera reconstruct 3‑D scenes and track its own motion—crucial for everything from AR apps that run on smartphones to autonomous‑driving stacks on low‑power hardware. The biggest pain point has been scale drift: over long video sequences the estimated size of objects and distances slowly diverges from reality. The new SCE‑SLAM system tackles this head‑on by learning scene coordinate embeddings that act as a global, scale‑aware reference, keeping the map “the right size” without sacrificing real‑time speed.

Key Contributions

  • Scene Coordinate Embeddings (SCE): Patch‑level descriptors that encode the 3‑D position of a pixel under a canonical scale, learned end‑to‑end.
  • Geometry‑Guided Aggregation: A novel attention mechanism that spreads scale information across frames using 3‑D spatial proximity, rather than just temporal adjacency.
  • Scene‑Coordinate Bundle Adjustment: An explicit global optimization step that ties current pose estimates to the learned canonical coordinates, directly correcting scale drift.
  • Real‑time performance: The full pipeline runs at ~36 FPS on a single GPU, matching or exceeding existing monocular SLAM systems.
  • Strong empirical gains: On KITTI the absolute trajectory error (ATE) drops by 8.36 m compared with the previous state‑of‑the‑art method, with similar improvements on Waymo and vKITTI datasets.

Methodology

  1. Feature Extraction & Embedding:
    • Input frames are passed through a CNN that outputs two streams: (a) traditional visual features for tracking, and (b) scene coordinate embeddings that predict a 3‑D point in a canonical coordinate system for each image patch.
  2. Geometry‑Guided Aggregation:
    • Instead of aggregating information only from the most recent keyframes, the system builds a spatial graph where nodes are patches and edges connect geometrically close points (using the current pose estimate).
    • A geometry‑modulated attention module then lets each patch borrow scale cues from its neighbors, effectively propagating reliable scale information from older, well‑observed parts of the map.
  3. Scene‑Coordinate Bundle Adjustment (SC‑BA):
    • The predicted 3‑D coordinates act as soft constraints in a global bundle adjustment.
    • The optimizer minimizes the reprojection error and the deviation of each patch’s predicted coordinate from the canonical reference, pulling the whole trajectory back to the correct scale.
  4. Loop Closure & Map Updating:
    • When a loop is detected, the same SC‑BA step aligns the looped segment to the canonical scale, eliminating accumulated drift.

All components are differentiable, allowing the network to be trained end‑to‑end on large driving datasets.

Results & Findings

DatasetMetric (lower is better)Prior BestSCE‑SLAM
KITTIAbsolute Trajectory Error (m)12.844.48 (‑8.36 m)
WaymoATE (m)9.213.97
vKITTIATE (m)1.840.71
  • Scale Consistency: Across long sequences (up to 10 km), the estimated scale remains within 2 % of ground truth, whereas baseline methods drift beyond 10 %.
  • Speed: The full pipeline processes 36 frames per second on an RTX 3080, comparable to ORB‑SLAM2 and faster than most learning‑based SLAM systems that require heavy post‑processing.
  • Robustness: The geometry‑guided attention helps recover from rapid motion or temporary occlusions, keeping the map stable even when visual features are sparse.

Practical Implications

  • AR/VR on Mobile: Developers can now rely on a single rear camera for persistent world anchors without periodic manual re‑calibration.
  • Autonomous Vehicles & Drones: Scale‑consistent maps mean more reliable distance estimates for planning and collision avoidance, especially on platforms that cannot afford stereo rigs or LiDAR.
  • Robotics in Warehouse/Factory Settings: Low‑cost robots can maintain accurate metric maps over days of operation, simplifying tasks like inventory tracking or path planning.
  • Infrastructure for 3‑D Mapping Services: Companies that ingest internet video (e.g., street‑view services) can generate metrically accurate 3‑D models without needing GPS‑scale corrections.

Because SCE‑SLAM is end‑to‑end and runs in real time, it can be dropped into existing monocular SLAM pipelines with minimal engineering effort—just replace the feature backend with the provided model and enable the SC‑BA module.

Limitations & Future Work

  • Training Data Dependency: The embeddings are learned on driving datasets; performance may degrade in indoor or highly unstructured environments without additional fine‑tuning.
  • GPU Requirement: Real‑time speeds were demonstrated on a high‑end GPU; embedded platforms may need model pruning or quantization.
  • Dynamic Objects: The current system assumes a mostly static scene; moving objects can corrupt the canonical coordinate predictions.
  • Future Directions: Authors suggest extending the embedding to handle dynamic scenes, exploring lightweight backbones for edge devices, and integrating semantic cues (e.g., object classes) to further stabilize scale in challenging conditions.

Authors

  • Yuchen Wu
  • Jiahe Li
  • Xiaohan Yu
  • Lina Yu
  • Jin Zheng
  • Xiao Bai

Paper Information

  • arXiv ID: 2601.09665v1
  • Categories: cs.CV
  • Published: January 14, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »