[Paper] UniCorrn: Unified Correspondence Transformer Across 2D and 3D

Published: 5 days ago (May 5, 2026 at 01:58 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2605.04044v1

Overview

UniCorrn introduces a single, unified Transformer model that can find correspondences across image‑to‑image (2D‑2D), image‑to‑point‑cloud (2D‑3D), and point‑cloud‑to‑point‑cloud (3D‑3D) data. By sharing weights across these three tasks, the paper shows that a common architecture can outperform specialized state‑of‑the‑art methods, especially on 2D‑3D and 3D‑3D registration benchmarks.

Key Contributions

First unified correspondence transformer that works for 2D‑2D, 2D‑3D, and 3D‑3D matching with a single set of parameters.
Dual‑stream decoder that keeps appearance (texture) and positional (geometry) features separate, enabling accurate cross‑modal similarity computation.
Modality‑agnostic encoder/decoder built on top of existing 2D (CNN) and 3D (PointNet/Transformer) backbones, allowing easy integration with common vision pipelines.
Joint training on mixed data (synthetic pseudo‑point clouds from depth maps + real 3D correspondence labels) to learn a robust, cross‑modal feature space.
State‑of‑the‑art performance: +8 % registration recall on 7Scenes (2D‑3D) and +10 % on 3DLoMatch (3D‑3D) while staying competitive on classic 2D‑2D benchmarks.

Methodology

Backbone extraction – Separate feature extractors process each input modality: a CNN for RGB images, and a point‑cloud encoder (e.g., PointNet++ or a small Transformer) for 3D data.
Shared Transformer encoder – The extracted tokens (image patches + point embeddings) are concatenated and fed into a standard Transformer encoder. Self‑attention naturally aligns features across modalities, learning a joint similarity metric.
Dual‑stream decoder – After encoding, the model splits into two parallel streams:
- Appearance stream – focuses on texture/color cues (useful for 2D‑2D).
- Positional stream – emphasizes geometric coordinates (critical for 2D‑3D and 3D‑3D).
  Each stream applies cross‑attention with a set of learnable query tokens that represent the target points we want to match.
Query‑based correspondence – For any source‑target pair, the model receives a small set of query tokens (e.g., keypoints in the source). The decoder returns the most similar tokens in the target modality, yielding the correspondence.
Training strategy – The authors combine:
- Synthetic pseudo‑point clouds generated from depth maps to boost 2D‑3D coverage.
- Real 3D‑3D correspondence annotations from datasets like 3DLoMatch.
  A multi‑task loss (contrastive + geometric consistency) encourages the shared weights to perform well on all three matching problems simultaneously.

Results & Findings

Task	Benchmark	Metric (Recall @ 5°)	Improvement vs. Prior SOTA
2D‑2D	HPatches	Competitive (≈ 0.85)	on par with dedicated models
2D‑3D	7Scenes	0.78	+8 %
3D‑3D	3DLoMatch	0.71	+10 %

The unified model does not sacrifice accuracy on any single task despite sharing parameters.
Ablation studies show the dual‑stream decoder contributes most of the gain for 2D‑3D and 3D‑3D, confirming the importance of separating appearance and geometry.
Training with mixed synthetic/real data yields a more robust feature space that generalizes to unseen scenes and sensor modalities.

Practical Implications

Simplified pipelines – Developers no longer need to maintain three separate models for SLAM, AR, or robotics; a single UniCorrn instance can handle visual odometry (2D‑2D), pose estimation from RGB‑D (2D‑3D), and point‑cloud registration (3D‑3D).
Reduced memory and deployment cost – Shared weights mean a smaller overall footprint, which is valuable for edge devices (e.g., drones, AR glasses).
Easier data collection – Because the model can be trained on mixed synthetic and real data, teams can bootstrap 2D‑3D capabilities without gathering large amounts of annotated 3D point‑cloud correspondences.
Cross‑modal research – The architecture opens doors for future work that mixes modalities beyond images and point clouds, such as LiDAR‑camera fusion or multi‑spectral matching.

Limitations & Future Work

Dependence on quality of synthetic depth – The pseudo‑point clouds are only as good as the depth estimation; noisy depth can hurt 2D‑3D performance.
Scalability to very large point clouds – While the Transformer encoder handles moderate sizes, extremely dense 3D scans may require hierarchical or sparse attention mechanisms.
Limited exploration of dynamic scenes – The current experiments focus on static geometry; extending UniCorrn to handle moving objects or temporal consistency is an open direction.
Future work suggested by the authors includes: integrating sparse‑attention Transformers for scalability, adding temporal query streams for video‑based correspondence, and expanding the training set with more diverse sensor modalities (e.g., thermal, radar).

Authors

Prajnan Goswami
Tianye Ding
Feng Liu
Huaizu Jiang

Paper Information

arXiv ID: 2605.04044v1
Categories: cs.CV
Published: May 5, 2026
PDF: Download PDF

[Paper] UniCorrn: Unified Correspondence Transformer Across 2D and 3D

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] 123D: Unifying Multi-Modal Autonomous Driving Data at Scale

[Paper] Normalizing Trajectory Models

[Paper] EmambaIR: Efficient Visual State Space Model for Event-guided Image Reconstruction

[Paper] Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment