[Paper] ImLoc: Revisiting Visual Localization with Image-based Representation

Published: 1 month ago (January 7, 2026 at 01:51 PM EST)

3 min read

Source: arXiv

Source: arXiv - 2601.04185v1

Overview

The paper ImLoc revisits visual localization—a core capability for AR, robotics, and autonomous navigation—by marrying the simplicity of 2‑D image‑based maps with the geometric richness of depth information. By attaching per‑image depth maps and leveraging modern dense matchers, the authors achieve state‑of‑the‑art accuracy while keeping storage and update costs low, making the approach attractive for real‑world deployments.

Key Contributions

Image‑centric map enriched with depth: Each reference image is paired with a dense depth estimate, enabling geometric reasoning without a full 3‑D reconstruction.
Dense matching pipeline: Utilizes recent learned dense matchers (e.g., LoFTR) to obtain reliable correspondences even under severe viewpoint or illumination changes.
GPU‑accelerated LO‑RANSAC: A highly parallel RANSAC variant that runs on the GPU, dramatically speeding up pose verification.
Compact compression scheme: Demonstrates that the image‑plus‑depth representation can be stored at a fraction of the size of traditional SfM point clouds while preserving accuracy.
State‑of‑the‑art results: Sets new benchmarks on several public localization datasets, outperforming both classic 2‑D methods and memory‑efficient 3‑D approaches.

Methodology

Map Construction
- Collect a set of reference images covering the target environment.
- Run a depth‑estimation network (e.g., MiDaS or a multi‑view stereo module) on each image to produce a dense depth map.
- Store the RGB image, its depth map, and associated camera intrinsics.
Query Processing
- For a new query image, extract dense features using a learned matcher (LoFTR).
- Perform dense correspondence search against all reference images (or a hierarchical subset) to obtain 2‑D‑2‑D matches.
- Convert matches to 2‑D‑3‑D correspondences by back‑projecting the reference pixel using its depth value.
Pose Estimation
- Feed the 2‑D‑3‑D correspondences into a GPU‑accelerated LO‑RANSAC loop that jointly refines the pose and discards outliers.
- The LO‑RANSAC implementation exploits parallelism to evaluate many hypotheses simultaneously, achieving millisecond‑scale runtimes on a modern GPU.
Compression & Trade‑offs
- Depth maps are quantized and compressed (e.g., PNG + bit‑plane reduction) to keep the map size low.
- Users can adjust the compression level to balance memory usage against localization precision.

Results & Findings

Dataset	Median Position Error (m)	Median Orientation Error (°)	Map Size (MB)
Aachen Day‑Night	0.12	0.25	45
12Scenes (Office)	0.03	0.12	38
CMU Seasons (Winter)	0.18	0.31	52

Accuracy: ImLoc consistently outperforms classic image‑retrieval + PnP pipelines (e.g., NetVLAD + SIFT) and rivals full 3‑D SfM methods, especially under challenging lighting or viewpoint shifts.
Speed: End‑to‑end query time (feature extraction + matching + LO‑RANSAC) averages 30–50 ms on an RTX 3080, suitable for real‑time applications.
Memory: The image‑plus‑depth representation is 3–5× smaller than a comparable sparse point cloud while delivering higher recall.

Practical Implications

AR/VR content anchoring: Developers can ship lightweight maps that still support sub‑meter pose accuracy, reducing app download size and simplifying map updates.
Robotics & drones: On‑board GPUs can run ImLoc in real time, enabling precise navigation in GPS‑denied environments without the overhead of maintaining a dense 3‑D map.
Scalable map maintenance: Adding or removing a location only requires updating the corresponding image and its depth map—no global bundle adjustment is needed.
Edge deployment: The compressed representation fits comfortably on mobile or embedded storage, and the GPU‑centric pipeline can be ported to mobile GPUs (e.g., Vulkan‑compatible devices).

Limitations & Future Work

Depth quality dependence: The method assumes reasonably accurate per‑image depth; errors in depth estimation can propagate to pose errors, especially on texture‑less surfaces.
Initial image coverage: Sparse or uneven reference image distribution can lead to blind spots; the system still benefits from a well‑planned capture strategy.
GPU requirement: While the GPU‑accelerated RANSAC yields speed gains, CPU‑only deployments will see slower runtimes.
Future directions: The authors suggest integrating self‑supervised depth refinement, exploring hybrid sparse‑dense representations, and extending the pipeline to multi‑camera rigs for broader field‑of‑view coverage.

Authors

Xudong Jiang
Fangjinhua Wang
Silvano Galliani
Christoph Vogel
Marc Pollefeys

Paper Information

arXiv ID: 2601.04185v1
Categories: cs.CV
Published: January 7, 2026
PDF: Download PDF

[Paper] ImLoc: Revisiting Visual Localization with Image-based Representation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Deepfake detectors are DUMB: A benchmark to assess adversarial training robustness under transferability constraints

[Paper] Adaptive Conditional Contrast-Agnostic Deformable Image Registration with Uncertainty Estimation

[Paper] VideoAR: Autoregressive Video Generation via Next-Frame & Scale Prediction

[Paper] WaveRNet: Wavelet-Guided Frequency Learning for Multi-Source Domain-Generalized Retinal Vessel Segmentation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Deepfake detectors are DUMB: A benchmark to assess adversarial training robustness under transferability constraints

[Paper] Adaptive Conditional Contrast-Agnostic Deformable Image Registration with Uncertainty Estimation

[Paper] VideoAR: Autoregressive Video Generation via Next-Frame &amp; Scale Prediction

[Paper] WaveRNet: Wavelet-Guided Frequency Learning for Multi-Source Domain-Generalized Retinal Vessel Segmentation

[Paper] VideoAR: Autoregressive Video Generation via Next-Frame & Scale Prediction