[Paper] M^3: Dense Matching Meets Multi-View Foundation Models for Monocular Gaussian Splatting SLAM

Published: 3 days ago (March 17, 2026 at 01:52 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2603.16844v1

Overview

The paper introduces M³, a novel SLAM system that blends dense matching with multi‑view foundation models to achieve high‑quality monocular Gaussian‑splatting reconstruction in real‑time. By tightening the loop between pose estimation and dense correspondence, M³ pushes the limits of streaming 3D reconstruction from a single moving camera, delivering both more accurate trajectories and sharper scene renderings.

Key Contributions

Matching‑augmented foundation model – adds a dedicated dense‑matching head to a multi‑view vision foundation model, delivering sub‑pixel correspondences suitable for geometric optimization.
Monocular Gaussian‑splatting SLAM – integrates the refined matches into a Gaussian‑splatting representation, enabling fast online scene updates while preserving high‑frequency detail.
Dynamic area suppression & cross‑inference alignment – novel tricks that stabilize tracking in dynamic or low‑texture regions and align the intrinsic parameters across inference passes.
State‑of‑the‑art performance – achieves a 64.3 % reduction in ATE RMSE over VGGT‑SLAM 2.0 and a 2.11 dB PSNR gain over ARTDECO on the challenging ScanNet++ benchmark.
Extensive real‑world validation – evaluated on diverse indoor and outdoor video sequences, demonstrating robustness across lighting, motion speed, and scene complexity.

Methodology

Backbone foundation model – starts from a pre‑trained multi‑view transformer that predicts coarse camera poses and feature maps from a monocular video stream.
Matching head – a lightweight convolutional module that takes the backbone’s feature maps and produces dense, pixel‑wise correspondences between consecutive frames. These matches are refined to sub‑pixel accuracy using a differentiable correlation layer.
Pose refinement loop – the dense matches feed a classic bundle‑adjustment‑style optimizer that updates the camera trajectory, now with the precision needed for geometry‑centric SLAM.
Gaussian splatting representation – the scene is modeled as a collection of 3D Gaussians (position, covariance, color, opacity). As new frames arrive, the optimizer updates existing Gaussians and spawns new ones, keeping the rendering pipeline real‑time.
Stability mechanisms
- Dynamic area suppression masks out regions with high motion or low texture to avoid corrupting the match signal.
- Cross‑inference intrinsic alignment enforces consistency of camera intrinsics across forward and backward passes, reducing drift.

All components run on a single GPU, allowing the system to process video at near‑real‑time speeds (≈15 fps on a RTX 3080).

Results & Findings

Benchmark	Metric	M³	VGGT‑SLAM 2.0	ARTDECO
ScanNet++ (indoor)	Pose ATE RMSE (m)	0.032	0.089	–
ScanNet++	Reconstruction PSNR (dB)	28.7	–	26.59
Outdoor (KITTI‑raw)	Pose ATE RMSE (m)	0.058	0.162	–

Pose accuracy improves dramatically because the dense matches eliminate the “pixel‑level drift” typical of feed‑forward pose heads.
Visual quality of the reconstructed scene (Gaussian splats) is noticeably sharper, especially around edges and thin structures.
Robustness tests show that M³ maintains stable tracking even when up to 30 % of the frame contains moving objects, thanks to the dynamic area suppression.

Practical Implications

AR/VR content creation – developers can now capture high‑fidelity 3D assets from a single phone camera without a calibration rig, speeding up pipeline prototyping.
Robotics navigation – the tighter pose‑reconstruction loop yields more reliable localization in texture‑poor or dynamic environments, useful for indoor service robots or drones.
Game engine integration – Gaussian splatting is already supported in modern renderers (e.g., Unity, Unreal). M³’s online splat generation means developers can stream live “digital twins” directly into these engines.
Edge deployment – the system’s GPU‑friendly design (no heavy 3D voxel grids) makes it feasible on high‑end mobile devices or embedded platforms for on‑device mapping.

Limitations & Future Work

Reliance on GPU acceleration – real‑time performance still hinges on a dedicated GPU; CPU‑only or low‑power devices may struggle.
Handling extreme motion blur – while dynamic area suppression mitigates some motion artifacts, very fast camera motion can still break the dense matching.
Scalability to very large scenes – the current Gaussian splatting implementation grows linearly with scene size; hierarchical or streaming strategies are needed for city‑scale reconstructions.
Future directions suggested by the authors include integrating learned depth priors to further reduce reliance on dense matches, and exploring transformer‑based pose refinement to eliminate the separate optimizer loop.

Authors

Kerui Ren
Guanghao Li
Changjian Jiang
Yingxiang Xu
Tao Lu
Linning Xu
Junting Dong
Jiangmiao Pang
Mulin Yu
Bo Dai

Paper Information

arXiv ID: 2603.16844v1
Categories: cs.CV
Published: March 17, 2026
PDF: Download PDF

[Paper] M^3: Dense Matching Meets Multi-View Foundation Models for Monocular Gaussian Splatting SLAM

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding

[Paper] Matryoshka Gaussian Splatting

[Paper] Cubic Discrete Diffusion: Discrete Visual Generation on High-Dimensional Representation Tokens

[Paper] MonoArt: Progressive Structural Reasoning for Monocular Articulated 3D Reconstruction