[Paper] Splat and Distill: Augmenting Teachers with Feed-Forward 3D Reconstruction For 3D-Aware Distillation

Published: (February 5, 2026 at 01:59 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.06032v1

Overview

The paper “Splat and Distill: Augmenting Teachers with Feed‑Forward 3D Reconstruction For 3D‑Aware Distillation” tackles a glaring blind spot in today’s Vision Foundation Models (VFMs): they excel at 2‑D perception but struggle to understand the underlying 3‑D geometry of a scene. By coupling a fast, feed‑forward 3‑D reconstruction step with the teacher‑student distillation paradigm, the authors inject explicit depth and surface‑normal cues into the teacher’s feature maps, enabling a student model to inherit genuine 3‑D awareness without costly per‑scene optimization.

Key Contributions

  • Feed‑forward 3‑D lifting: Converts 2‑D teacher features into a compact Gaussian‑based 3‑D representation on the fly, eliminating slow, iterative optimization used in prior work.
  • Splat‑based novel‑view synthesis: Projects the lifted 3‑D features onto arbitrary viewpoints, generating multiple 2‑D feature maps that serve as geometry‑grounded supervision for the student.
  • Dynamic teacher‑student consistency: The teacher’s features improve as the student learns, creating a virtuous loop that mitigates the “feature‑averaging” artifacts common in static distillation pipelines.
  • Broad downstream evaluation: Demonstrates sizable gains across monocular depth, surface‑normal estimation, multi‑view correspondence, and semantic segmentation, showing that 3‑D awareness also boosts semantic richness.
  • Open‑source implementation & project page: Provides code and pretrained models, facilitating immediate experimentation by the community.

Methodology

  1. Teacher Feature Extraction – A pre‑trained 2‑D VFM (e.g., CLIP, DINO) processes an input image and outputs dense feature maps.
  2. Feed‑Forward 3‑D Lifting – Each pixel’s feature vector is lifted into a 3‑D Gaussian blob positioned using a coarse depth estimate (derived from the teacher’s own features or a lightweight depth predictor). The collection of Gaussians forms an explicit, differentiable 3‑D point‑cloud‑like representation.
  3. Splatting to Novel Views – The 3‑D Gaussians are projected (“splatted”) onto a set of synthetic camera poses (e.g., slight rotations or translations). This yields several new 2‑D feature maps that encode how the scene would look from those viewpoints, preserving geometric consistency.
  4. Distillation Loss – The student model (often a smaller or task‑specific network) is trained to reproduce the splatted feature maps. The loss combines a standard feature‑matching term with a geometry‑aware regularizer that penalizes inconsistencies across views.
  5. Iterative Refinement – As the student improves, its predictions can be fed back to refine the depth estimates used for lifting, tightening the teacher‑student loop.

The entire pipeline is feed‑forward: no per‑scene gradient descent or expensive volumetric rendering is required, making it suitable for large‑scale training.

Results & Findings

Downstream TaskBaseline (no 3‑D)Prior 3‑D‑aware DistillationSplat‑and‑Distill
Monocular Depth (RMSE ↓)0.680.610.53
Surface Normal (Mean° ↓)23.119.416.2
Multi‑view Correspondence (PCK ↑)71.3%78.5%84.9%
Semantic Segmentation (mIoU ↑)62.4%66.1%70.8%
  • 3‑D awareness: Depth and normal errors drop dramatically, confirming that the student learns genuine geometry.
  • Semantic boost: Even tasks that are purely 2‑D (segmentation) see a ~8 % mIoU lift, suggesting that richer geometry also clarifies object boundaries and context.
  • Speed: The feed‑forward lifting runs at ~30 fps on a single RTX 3090, a >10× speedup over optimization‑based methods that require minutes per scene.

Practical Implications

  • Enhanced AR/VR pipelines: Developers can now fine‑tune lightweight perception models that already understand depth and surface orientation, reducing reliance on separate depth sensors.
  • Robust robotics perception: A robot equipped with a distilled model can infer 3‑D structure from a single camera, improving navigation and manipulation without expensive LiDAR.
  • Improved content creation tools: Image‑to‑3‑D generators, background removal, and scene‑editing software can leverage the geometry‑aware features to produce more accurate masks and depth maps.
  • Efficient model compression: The framework enables distilling large, expensive VFMs into smaller, deployable models that retain both semantic and geometric competence—ideal for edge devices.
  • Plug‑and‑play integration: Since the method works with any off‑the‑shelf teacher (CLIP, DINO, MAE, etc.), teams can retrofit existing pipelines without retraining the massive teacher from scratch.

Limitations & Future Work

  • Coarse depth initialization: The lifting step relies on an approximate depth estimate; errors in this seed can propagate to the Gaussian representation.
  • View synthesis range: The method assumes modest viewpoint changes; extreme novel views may suffer from insufficient coverage in the Gaussian cloud.
  • Domain shift: While the authors test on several benchmarks, performance on highly out‑of‑distribution scenes (e.g., medical imaging, satellite data) remains unverified.
  • Future directions: The authors suggest exploring learned depth priors for more accurate lifting, integrating neural radiance fields for richer view synthesis, and extending the framework to video streams for temporal consistency.

If you’re curious to experiment, the authors have released code and pretrained checkpoints on their project page. Plug the “Splat and Distill” module into your existing VFM pipeline and watch your models gain a 3‑D perspective—without the usual computational overhead.

Authors

  • David Shavin
  • Sagie Benaim

Paper Information

  • arXiv ID: 2602.06032v1
  • Categories: cs.CV
  • Published: February 5, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »