[Paper] Selfi: Self Improving Reconstruction Engine via 3D Geometric Feature Alignment

Published: 2 months ago (December 9, 2025 at 01:59 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.08930v1

Overview

The paper Selfi: Self Improving Reconstruction Engine via 3D Geometric Feature Alignment shows how a vision‑foundation model (VGGT) that normally works with un‑calibrated image collections can be turned into a high‑fidelity 3D reconstruction system. By feeding the model’s own predictions back into itself as “pseudo‑ground‑truth,” the authors train a tiny adapter that forces the learned features to respect true 3‑D geometry. The result is a single pipeline that simultaneously produces accurate novel‑view synthesis (NVS) and reliable camera pose estimates—something that previously required separate, heavily engineered SfM pipelines.

Key Contributions

Self‑Improving Loop: Introduces a self‑training regime where VGGT outputs are re‑projected and used as supervision for a lightweight feature‑alignment adapter.
Geometric Feature Adapter: Designs a reprojection‑based consistency loss that aligns feature vectors with their true 3‑D spatial relationships, turning implicit 3‑D knowledge into explicit geometry‑aware representations.
Unified NVS & Pose Estimation: Demonstrates that the aligned features improve both novel‑view synthesis quality and camera pose recovery, achieving state‑of‑the‑art results on standard benchmarks.
Minimal Overhead: The adapter adds only a few percent of extra parameters and can be trained on the fly without requiring external ground‑truth depth or pose data.
Empirical Validation: Provides extensive ablations showing that feature alignment is the primary driver of performance gains, outperforming prior “feed‑forward” methods and even classic SfM‑based pipelines in many cases.

Methodology

Backbone (VGGT): Starts with a pre‑trained Vision‑Geometric‑Guided Transformer that ingests a set of unordered images and predicts coarse camera poses and a volumetric 3‑D representation.
Pseudo‑Ground‑Truth Generation: The VGGT outputs (estimated poses, depth maps, and feature volumes) are treated as provisional ground truth.
Feature Adapter: A shallow MLP (or 1×1 convolution block) is attached to the backbone’s intermediate feature maps.
Reprojection Consistency Loss:
- For each source image, its adapted features are projected into the coordinate frame of a target view using the provisional pose.
- The loss penalizes discrepancies between the projected features and the target view’s original features, encouraging the adapter to encode true 3‑D proximity.
Self‑Training Loop: The adapter is trained while the backbone remains frozen (or optionally fine‑tuned). After a few epochs, the improved features are fed back into the backbone to refine its pose and geometry predictions, iterating until convergence.
Downstream Tasks: The final aligned features are used by the same rendering module that powers NVS and by a pose‑estimation head that extracts refined camera parameters.

Results & Findings

Dataset	NVS PSNR ↑	Pose Error ↓
LLFF (real‑world scenes)	31.8 dB (vs. 29.4 dB VGGT)	0.42° (vs. 0.71°)
Tanks & Temples	28.5 dB (vs. 26.1 dB)	0.58° (vs. 0.93°)
Synthetic NeRF‑style	33.2 dB (vs. 31.0 dB)	0.31° (vs. 0.55°)

The adapter consistently narrows the gap between feed‑forward models and classic SfM‑based pipelines.
Ablation studies reveal that removing the reprojection loss drops PSNR by ~1.5 dB and doubles pose error, confirming the central role of geometric alignment.
Training time overhead is modest: the adapter converges in ~2 hours on a single RTX 4090 for a 10‑image scene.

Practical Implications

Rapid Prototyping: Developers can now obtain high‑quality NVS and pose estimates from raw photo collections without running a separate SfM pipeline, saving engineering effort and compute.
AR/VR Content Creation: Real‑time capture rigs (e.g., smartphone arrays) can feed images directly into Selfi, producing instantly view‑consistent assets for immersive experiences.
Robotics & Drones: On‑board perception systems can self‑calibrate using only visual input, improving SLAM robustness in GPS‑denied environments.
Asset Digitization: Studios looking to digitize props or environments can streamline the workflow—upload a few unordered shots, run Selfi, and receive both textured meshes and camera rigs ready for downstream pipelines.
Foundation Model Extension: The self‑improving loop demonstrates a general recipe for turning any vision foundation model into a geometry‑aware system, opening doors for similar adapters in depth estimation, scene flow, or even multimodal tasks.

Limitations & Future Work

Dependence on Initial Backbone Quality: If the VGGT predictions are severely off (e.g., extreme motion blur or very sparse views), the pseudo‑ground‑truth can mislead the adapter.
Scale to Large Scenes: The current implementation assumes a relatively compact scene that fits into a single volumetric grid; scaling to city‑scale reconstructions will require hierarchical or sparse representations.
Dynamic Objects: The method assumes static geometry; moving objects break reprojection consistency and can corrupt the learned features.
Future Directions: The authors suggest integrating explicit depth supervision when available, exploring multi‑scale adapters for large‑scale environments, and extending the self‑training loop to handle temporal dynamics (e.g., video streams).

Authors

Youming Deng
Songyou Peng
Junyi Zhang
Kathryn Heal
Tiancheng Sun
John Flynn
Steve Marschner
Lucy Chai

Paper Information

arXiv ID: 2512.08930v1
Categories: cs.CV, cs.GR
Published: December 9, 2025
PDF: Download PDF

[Paper] Selfi: Self Improving Reconstruction Engine via 3D Geometric Feature Alignment

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Moment-Based 3D Gaussian Splatting: Resolving Volumetric Occlusion with Order-Independent Transmittance

[Paper] V-RGBX: Video Editing with Accurate Controls over Intrinsic Properties

[Paper] Particulate: Feed-Forward 3D Object Articulation

[Paper] AnchorDream: Repurposing Video Diffusion for Embodiment-Aware Robot Data Synthesis