[Paper] Multi-View Foundation Models
Source: arXiv - 2512.15708v1
Overview
The paper introduces a simple yet powerful recipe for turning any single‑image vision foundation model (e.g., DINO, SAM, CLIP) into a Multi‑View Foundation Model that reasons over a set of images captured from different viewpoints of the same 3D scene. By adding a lightweight 3D‑aware attention module, the authors enforce feature consistency across views without having to reconstruct an explicit 3‑D model first. This makes it far easier to reuse existing pretrained models for tasks such as multi‑view segmentation or surface‑normal estimation.
Key Contributions
- General conversion pipeline: A plug‑and‑play method that upgrades any transformer‑based vision foundation model to handle multiple views jointly.
- 3D‑aware attention layers: Introduces intermediate attention blocks that explicitly align features of corresponding 3‑D points across images.
- No explicit 3‑D reconstruction needed: Consistency is achieved directly in image space, sidestepping costly voxel/mesh building.
- Demonstrated on two downstream tasks:
- Multi‑view surface‑normal estimation.
- Multi‑view semantic segmentation.
- Empirical gains: Shows substantial improvements in feature matching accuracy and downstream task performance compared to vanilla foundation models.
Methodology
- Start with a pretrained transformer foundation model (e.g., DINO’s ViT encoder).
- Insert a “3D‑aware attention” module after a chosen transformer block.
- The module receives the per‑pixel token embeddings from each view.
- It computes cross‑view attention using estimated camera poses (or learned pose embeddings) so that tokens representing the same 3‑D point attend to each other.
- The attention output is added back to the original tokens, encouraging them to become view‑consistent.
- Training objective:
- A contrastive loss that pulls together features of corresponding 3‑D points across views while pushing apart unrelated points.
- Optional auxiliary losses (e.g., surface‑normal regression, segmentation masks) for the downstream tasks.
- Inference: Feed a batch of images from the same scene; the model returns a feature map per image that is already aligned across views, ready for any downstream head (e.g., a normal estimator or a segmentation decoder).
The whole pipeline is lightweight—only a few extra attention layers—so it can be dropped into existing pipelines with minimal engineering effort.
Results & Findings
| Task | Baseline (single‑view FM) | Multi‑View FM (proposed) | Relative Gain |
|---|---|---|---|
| Surface‑normal estimation (RMSE) | 28.4° | 22.1° | ~22% improvement |
| Multi‑view segmentation (mIoU) | 61.3% | 68.7% | ~12% improvement |
| Feature matching accuracy (AUC@10°) | 0.71 | 0.84 | +0.13 |
Key takeaways
- The added attention layers dramatically increase the geometric consistency of the learned embeddings.
- Downstream tasks that rely on cross‑view correspondence (normals, segmentation) benefit directly, often closing the gap to methods that explicitly build a 3‑D model.
- The approach works across several backbone models (DINO, SAM, CLIP), confirming its generality.
Practical Implications
- Rapid prototyping: Developers can reuse existing pretrained vision models for multi‑view problems without retraining from scratch or building a full 3‑D pipeline.
- Robotics & AR/VR: Consistent features across camera frames enable more reliable pose tracking, scene understanding, and object manipulation in real‑time systems.
- Large‑scale mapping: Drone or handheld capture workflows can generate dense, aligned feature maps on‑the‑fly, simplifying downstream photogrammetry or semantic mapping pipelines.
- Cost‑effective scaling: Since the extra layers are small, the memory and compute overhead is modest, making it feasible to run on edge GPUs or even mobile accelerators.
- Plug‑in for existing APIs: Companies that already expose DINO/CLIP embeddings (e.g., via cloud services) can extend them to multi‑view scenarios with a thin wrapper, opening new product features like multi‑camera segmentation or cross‑view search.
Limitations & Future Work
- Reliance on accurate camera poses: The current implementation assumes known extrinsics; noisy pose estimates can degrade alignment.
- Scalability to many views: Attention cost grows quadratically with the number of images, so very large view sets may need hierarchical or sparse attention tricks.
- Limited to transformer backbones: While the authors show results on DINO, SAM, and CLIP, extending the idea to CNN‑based foundations remains unexplored.
- Future directions suggested by the authors include learning pose estimation jointly with the attention module, exploring sparse‑attention mechanisms for large view batches, and applying the framework to video‑level tasks such as 3‑D object detection or scene flow estimation.
Authors
- Leo Segre
- Or Hirschorn
- Shai Avidan
Paper Information
- arXiv ID: 2512.15708v1
- Categories: cs.CV
- Published: December 17, 2025
- PDF: Download PDF