[Paper] Selfi: Self Improving Reconstruction Engine via 3D Geometric Feature Alignment
Source: arXiv - 2512.08930v1
Overview
The paper Selfi: Self Improving Reconstruction Engine via 3D Geometric Feature Alignment shows how a vision‑foundation model (VGGT) that normally works with un‑calibrated image collections can be turned into a high‑fidelity 3D reconstruction system. By feeding the model’s own predictions back into itself as “pseudo‑ground‑truth,” the authors train a tiny adapter that forces the learned features to respect true 3‑D geometry. The result is a single pipeline that simultaneously produces accurate novel‑view synthesis (NVS) and reliable camera pose estimates—something that previously required separate, heavily engineered SfM pipelines.
Key Contributions
- Self‑Improving Loop: Introduces a self‑training regime where VGGT outputs are re‑projected and used as supervision for a lightweight feature‑alignment adapter.
- Geometric Feature Adapter: Designs a reprojection‑based consistency loss that aligns feature vectors with their true 3‑D spatial relationships, turning implicit 3‑D knowledge into explicit geometry‑aware representations.
- Unified NVS & Pose Estimation: Demonstrates that the aligned features improve both novel‑view synthesis quality and camera pose recovery, achieving state‑of‑the‑art results on standard benchmarks.
- Minimal Overhead: The adapter adds only a few percent of extra parameters and can be trained on the fly without requiring external ground‑truth depth or pose data.
- Empirical Validation: Provides extensive ablations showing that feature alignment is the primary driver of performance gains, outperforming prior “feed‑forward” methods and even classic SfM‑based pipelines in many cases.
Methodology
- Backbone (VGGT): Starts with a pre‑trained Vision‑Geometric‑Guided Transformer that ingests a set of unordered images and predicts coarse camera poses and a volumetric 3‑D representation.
- Pseudo‑Ground‑Truth Generation: The VGGT outputs (estimated poses, depth maps, and feature volumes) are treated as provisional ground truth.
- Feature Adapter: A shallow MLP (or 1×1 convolution block) is attached to the backbone’s intermediate feature maps.
- Reprojection Consistency Loss:
- For each source image, its adapted features are projected into the coordinate frame of a target view using the provisional pose.
- The loss penalizes discrepancies between the projected features and the target view’s original features, encouraging the adapter to encode true 3‑D proximity.
- Self‑Training Loop: The adapter is trained while the backbone remains frozen (or optionally fine‑tuned). After a few epochs, the improved features are fed back into the backbone to refine its pose and geometry predictions, iterating until convergence.
- Downstream Tasks: The final aligned features are used by the same rendering module that powers NVS and by a pose‑estimation head that extracts refined camera parameters.
Results & Findings
| Dataset | NVS PSNR ↑ | Pose Error ↓ |
|---|---|---|
| LLFF (real‑world scenes) | 31.8 dB (vs. 29.4 dB VGGT) | 0.42° (vs. 0.71°) |
| Tanks & Temples | 28.5 dB (vs. 26.1 dB) | 0.58° (vs. 0.93°) |
| Synthetic NeRF‑style | 33.2 dB (vs. 31.0 dB) | 0.31° (vs. 0.55°) |
- The adapter consistently narrows the gap between feed‑forward models and classic SfM‑based pipelines.
- Ablation studies reveal that removing the reprojection loss drops PSNR by ~1.5 dB and doubles pose error, confirming the central role of geometric alignment.
- Training time overhead is modest: the adapter converges in ~2 hours on a single RTX 4090 for a 10‑image scene.
Practical Implications
- Rapid Prototyping: Developers can now obtain high‑quality NVS and pose estimates from raw photo collections without running a separate SfM pipeline, saving engineering effort and compute.
- AR/VR Content Creation: Real‑time capture rigs (e.g., smartphone arrays) can feed images directly into Selfi, producing instantly view‑consistent assets for immersive experiences.
- Robotics & Drones: On‑board perception systems can self‑calibrate using only visual input, improving SLAM robustness in GPS‑denied environments.
- Asset Digitization: Studios looking to digitize props or environments can streamline the workflow—upload a few unordered shots, run Selfi, and receive both textured meshes and camera rigs ready for downstream pipelines.
- Foundation Model Extension: The self‑improving loop demonstrates a general recipe for turning any vision foundation model into a geometry‑aware system, opening doors for similar adapters in depth estimation, scene flow, or even multimodal tasks.
Limitations & Future Work
- Dependence on Initial Backbone Quality: If the VGGT predictions are severely off (e.g., extreme motion blur or very sparse views), the pseudo‑ground‑truth can mislead the adapter.
- Scale to Large Scenes: The current implementation assumes a relatively compact scene that fits into a single volumetric grid; scaling to city‑scale reconstructions will require hierarchical or sparse representations.
- Dynamic Objects: The method assumes static geometry; moving objects break reprojection consistency and can corrupt the learned features.
- Future Directions: The authors suggest integrating explicit depth supervision when available, exploring multi‑scale adapters for large‑scale environments, and extending the self‑training loop to handle temporal dynamics (e.g., video streams).
Authors
- Youming Deng
- Songyou Peng
- Junyi Zhang
- Kathryn Heal
- Tiancheng Sun
- John Flynn
- Steve Marschner
- Lucy Chai
Paper Information
- arXiv ID: 2512.08930v1
- Categories: cs.CV, cs.GR
- Published: December 9, 2025
- PDF: Download PDF