[Paper] Selfi: Self Improving Reconstruction Engine via 3D Geometric Feature Alignment

Published: (December 9, 2025 at 01:59 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.08930v1

Overview

The paper Selfi: Self Improving Reconstruction Engine via 3D Geometric Feature Alignment shows how a vision‑foundation model (VGGT) that normally works with un‑calibrated image collections can be turned into a high‑fidelity 3D reconstruction system. By feeding the model’s own predictions back into itself as “pseudo‑ground‑truth,” the authors train a tiny adapter that forces the learned features to respect true 3‑D geometry. The result is a single pipeline that simultaneously produces accurate novel‑view synthesis (NVS) and reliable camera pose estimates—something that previously required separate, heavily engineered SfM pipelines.

Key Contributions

  • Self‑Improving Loop: Introduces a self‑training regime where VGGT outputs are re‑projected and used as supervision for a lightweight feature‑alignment adapter.
  • Geometric Feature Adapter: Designs a reprojection‑based consistency loss that aligns feature vectors with their true 3‑D spatial relationships, turning implicit 3‑D knowledge into explicit geometry‑aware representations.
  • Unified NVS & Pose Estimation: Demonstrates that the aligned features improve both novel‑view synthesis quality and camera pose recovery, achieving state‑of‑the‑art results on standard benchmarks.
  • Minimal Overhead: The adapter adds only a few percent of extra parameters and can be trained on the fly without requiring external ground‑truth depth or pose data.
  • Empirical Validation: Provides extensive ablations showing that feature alignment is the primary driver of performance gains, outperforming prior “feed‑forward” methods and even classic SfM‑based pipelines in many cases.

Methodology

  1. Backbone (VGGT): Starts with a pre‑trained Vision‑Geometric‑Guided Transformer that ingests a set of unordered images and predicts coarse camera poses and a volumetric 3‑D representation.
  2. Pseudo‑Ground‑Truth Generation: The VGGT outputs (estimated poses, depth maps, and feature volumes) are treated as provisional ground truth.
  3. Feature Adapter: A shallow MLP (or 1×1 convolution block) is attached to the backbone’s intermediate feature maps.
  4. Reprojection Consistency Loss:
    • For each source image, its adapted features are projected into the coordinate frame of a target view using the provisional pose.
    • The loss penalizes discrepancies between the projected features and the target view’s original features, encouraging the adapter to encode true 3‑D proximity.
  5. Self‑Training Loop: The adapter is trained while the backbone remains frozen (or optionally fine‑tuned). After a few epochs, the improved features are fed back into the backbone to refine its pose and geometry predictions, iterating until convergence.
  6. Downstream Tasks: The final aligned features are used by the same rendering module that powers NVS and by a pose‑estimation head that extracts refined camera parameters.

Results & Findings

DatasetNVS PSNR ↑Pose Error ↓
LLFF (real‑world scenes)31.8 dB (vs. 29.4 dB VGGT)0.42° (vs. 0.71°)
Tanks & Temples28.5 dB (vs. 26.1 dB)0.58° (vs. 0.93°)
Synthetic NeRF‑style33.2 dB (vs. 31.0 dB)0.31° (vs. 0.55°)
  • The adapter consistently narrows the gap between feed‑forward models and classic SfM‑based pipelines.
  • Ablation studies reveal that removing the reprojection loss drops PSNR by ~1.5 dB and doubles pose error, confirming the central role of geometric alignment.
  • Training time overhead is modest: the adapter converges in ~2 hours on a single RTX 4090 for a 10‑image scene.

Practical Implications

  • Rapid Prototyping: Developers can now obtain high‑quality NVS and pose estimates from raw photo collections without running a separate SfM pipeline, saving engineering effort and compute.
  • AR/VR Content Creation: Real‑time capture rigs (e.g., smartphone arrays) can feed images directly into Selfi, producing instantly view‑consistent assets for immersive experiences.
  • Robotics & Drones: On‑board perception systems can self‑calibrate using only visual input, improving SLAM robustness in GPS‑denied environments.
  • Asset Digitization: Studios looking to digitize props or environments can streamline the workflow—upload a few unordered shots, run Selfi, and receive both textured meshes and camera rigs ready for downstream pipelines.
  • Foundation Model Extension: The self‑improving loop demonstrates a general recipe for turning any vision foundation model into a geometry‑aware system, opening doors for similar adapters in depth estimation, scene flow, or even multimodal tasks.

Limitations & Future Work

  • Dependence on Initial Backbone Quality: If the VGGT predictions are severely off (e.g., extreme motion blur or very sparse views), the pseudo‑ground‑truth can mislead the adapter.
  • Scale to Large Scenes: The current implementation assumes a relatively compact scene that fits into a single volumetric grid; scaling to city‑scale reconstructions will require hierarchical or sparse representations.
  • Dynamic Objects: The method assumes static geometry; moving objects break reprojection consistency and can corrupt the learned features.
  • Future Directions: The authors suggest integrating explicit depth supervision when available, exploring multi‑scale adapters for large‑scale environments, and extending the self‑training loop to handle temporal dynamics (e.g., video streams).

Authors

  • Youming Deng
  • Songyou Peng
  • Junyi Zhang
  • Kathryn Heal
  • Tiancheng Sun
  • John Flynn
  • Steve Marschner
  • Lucy Chai

Paper Information

  • arXiv ID: 2512.08930v1
  • Categories: cs.CV, cs.GR
  • Published: December 9, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »