[Paper] UniCorrn: Unified Correspondence Transformer Across 2D and 3D
Source: arXiv - 2605.04044v1
Overview
UniCorrn introduces a single, unified Transformer model that can find correspondences across image‑to‑image (2D‑2D), image‑to‑point‑cloud (2D‑3D), and point‑cloud‑to‑point‑cloud (3D‑3D) data. By sharing weights across these three tasks, the paper shows that a common architecture can outperform specialized state‑of‑the‑art methods, especially on 2D‑3D and 3D‑3D registration benchmarks.
Key Contributions
- First unified correspondence transformer that works for 2D‑2D, 2D‑3D, and 3D‑3D matching with a single set of parameters.
- Dual‑stream decoder that keeps appearance (texture) and positional (geometry) features separate, enabling accurate cross‑modal similarity computation.
- Modality‑agnostic encoder/decoder built on top of existing 2D (CNN) and 3D (PointNet/Transformer) backbones, allowing easy integration with common vision pipelines.
- Joint training on mixed data (synthetic pseudo‑point clouds from depth maps + real 3D correspondence labels) to learn a robust, cross‑modal feature space.
- State‑of‑the‑art performance: +8 % registration recall on 7Scenes (2D‑3D) and +10 % on 3DLoMatch (3D‑3D) while staying competitive on classic 2D‑2D benchmarks.
Methodology
- Backbone extraction – Separate feature extractors process each input modality: a CNN for RGB images, and a point‑cloud encoder (e.g., PointNet++ or a small Transformer) for 3D data.
- Shared Transformer encoder – The extracted tokens (image patches + point embeddings) are concatenated and fed into a standard Transformer encoder. Self‑attention naturally aligns features across modalities, learning a joint similarity metric.
- Dual‑stream decoder – After encoding, the model splits into two parallel streams:
- Appearance stream – focuses on texture/color cues (useful for 2D‑2D).
- Positional stream – emphasizes geometric coordinates (critical for 2D‑3D and 3D‑3D).
Each stream applies cross‑attention with a set of learnable query tokens that represent the target points we want to match.
- Query‑based correspondence – For any source‑target pair, the model receives a small set of query tokens (e.g., keypoints in the source). The decoder returns the most similar tokens in the target modality, yielding the correspondence.
- Training strategy – The authors combine:
- Synthetic pseudo‑point clouds generated from depth maps to boost 2D‑3D coverage.
- Real 3D‑3D correspondence annotations from datasets like 3DLoMatch.
A multi‑task loss (contrastive + geometric consistency) encourages the shared weights to perform well on all three matching problems simultaneously.
Results & Findings
| Task | Benchmark | Metric (Recall @ 5°) | Improvement vs. Prior SOTA |
|---|---|---|---|
| 2D‑2D | HPatches | Competitive (≈ 0.85) | on par with dedicated models |
| 2D‑3D | 7Scenes | 0.78 | +8 % |
| 3D‑3D | 3DLoMatch | 0.71 | +10 % |
- The unified model does not sacrifice accuracy on any single task despite sharing parameters.
- Ablation studies show the dual‑stream decoder contributes most of the gain for 2D‑3D and 3D‑3D, confirming the importance of separating appearance and geometry.
- Training with mixed synthetic/real data yields a more robust feature space that generalizes to unseen scenes and sensor modalities.
Practical Implications
- Simplified pipelines – Developers no longer need to maintain three separate models for SLAM, AR, or robotics; a single UniCorrn instance can handle visual odometry (2D‑2D), pose estimation from RGB‑D (2D‑3D), and point‑cloud registration (3D‑3D).
- Reduced memory and deployment cost – Shared weights mean a smaller overall footprint, which is valuable for edge devices (e.g., drones, AR glasses).
- Easier data collection – Because the model can be trained on mixed synthetic and real data, teams can bootstrap 2D‑3D capabilities without gathering large amounts of annotated 3D point‑cloud correspondences.
- Cross‑modal research – The architecture opens doors for future work that mixes modalities beyond images and point clouds, such as LiDAR‑camera fusion or multi‑spectral matching.
Limitations & Future Work
- Dependence on quality of synthetic depth – The pseudo‑point clouds are only as good as the depth estimation; noisy depth can hurt 2D‑3D performance.
- Scalability to very large point clouds – While the Transformer encoder handles moderate sizes, extremely dense 3D scans may require hierarchical or sparse attention mechanisms.
- Limited exploration of dynamic scenes – The current experiments focus on static geometry; extending UniCorrn to handle moving objects or temporal consistency is an open direction.
- Future work suggested by the authors includes: integrating sparse‑attention Transformers for scalability, adding temporal query streams for video‑based correspondence, and expanding the training set with more diverse sensor modalities (e.g., thermal, radar).
Authors
- Prajnan Goswami
- Tianye Ding
- Feng Liu
- Huaizu Jiang
Paper Information
- arXiv ID: 2605.04044v1
- Categories: cs.CV
- Published: May 5, 2026
- PDF: Download PDF