[Paper] Foundry: Distilling 3D Foundation Models for the Edge

Published: (November 25, 2025 at 02:53 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2511.20721v1

Overview

The paper introduces Foundry, the first system that compresses large self‑supervised 3‑D foundation models into tiny, edge‑ready networks without losing their “one‑model‑fits‑all” capability. By distilling a teacher model’s rich token representations into a compact set of SuperTokens, Foundry makes high‑quality 3‑D perception feasible on robots, AR/VR headsets, and other compute‑constrained devices.

Key Contributions

  • Foundation Model Distillation (FMD) – a new distillation paradigm that preserves the general‑purpose nature of SSL foundation models rather than tailoring a specialist for a single downstream task.
  • Foundry implementation for 3‑D point clouds – the first practical FMD system that works on volumetric data, a domain traditionally dominated by heavyweight models.
  • SuperToken representation – a learned, highly compressed token set that can reconstruct the teacher’s full token matrix, acting as a compact basis of the latent space.
  • Broad transferability – a single distilled model achieves near‑teacher performance on classification, part segmentation, and few‑shot learning without any task‑specific fine‑tuning.
  • Edge‑friendly efficiency – up to 70 % reduction in FLOPs and 80 % fewer tokens, enabling real‑time inference on devices with limited GPU/CPU budgets.

Methodology

  1. Teacher pre‑training – A large 3‑D SSL model (e.g., Point‑MAE or Masked Autoencoder for point clouds) is trained on massive unlabeled point‑cloud datasets to learn generic geometry embeddings.
  2. SuperToken generation – Instead of copying the teacher’s full token sequence, Foundry learns a small set of learnable SuperTokens. These act like a dictionary that can linearly combine to approximate any teacher token.
  3. Distillation objective – The student network is trained to (a) predict the SuperTokens from raw point clouds and (b) reconstruct the teacher’s token‑level features using a simple linear decoder. The loss combines a reconstruction term (L2 on token embeddings) and a contrastive term to preserve relational geometry.
  4. Task‑agnostic fine‑tuning – After distillation, the student model is frozen and directly plugged into downstream pipelines (e.g., a linear classifier or a segmentation head). No extra task‑specific training is required, demonstrating that the distilled representation remains broadly useful.

The whole pipeline runs on a single GPU and finishes in a few days for a model the size of a typical 3‑D foundation model, making it practical for research labs and industry teams.

Results & Findings

MetricTeacher (full)Foundry (distilled)Δ
Classification accuracy (ModelNet40)93.2 %91.8 %–1.4 %
Part segmentation mIoU (ShapeNetPart)85.6 %84.1 %–1.5 %
Few‑shot (5‑shot) classification88.0 %86.5 %–1.5 %
FLOPs (G)12.43.8–69 %
Token count1024256–75 %

Key takeaways

  • The distilled model stays within 1–2 % of the teacher on all evaluated tasks, confirming that the SuperToken basis captures the essential geometry information.
  • Computational savings are dramatic: inference speed improves by ~3× on a Jetson Nano‑class device, and memory usage drops enough to fit multiple point‑cloud streams in parallel.
  • The same distilled checkpoint works across tasks, validating the FMD claim of “downstream‑agnostic” compression.

Practical Implications

  • Robotics – Autonomous drones and warehouse robots can now run high‑fidelity 3‑D perception (obstacle detection, object grasping) on embedded CPUs/GPUs, extending battery life and reducing hardware cost.
  • AR/VR – Real‑time scene understanding for hand‑tracking or spatial mapping becomes feasible on headset‑grade silicon, opening doors for more immersive experiences without cloud off‑loading.
  • Edge AI platforms – Cloud‑to‑edge pipelines can ship a single distilled model that serves multiple services (classification, segmentation, anomaly detection), simplifying deployment and versioning.
  • Rapid prototyping – Developers can experiment with foundation‑model quality without needing a data‑center GPU, accelerating product cycles for startups and research labs.

Limitations & Future Work

  • Domain shift – The paper evaluates on standard benchmarks; performance under severe sensor noise or novel object categories (e.g., LiDAR in adverse weather) remains untested.
  • SuperToken count trade‑off – While 256 tokens work well, finding the optimal token budget for a given hardware budget still requires manual tuning.
  • Extension to multimodal 3‑D – The current work focuses on pure point clouds; integrating RGB or tactile data into the FMD framework is an open avenue.
  • Theoretical guarantees – The authors note that a formal analysis of how much information the SuperToken basis can retain is lacking, suggesting future work on compression bounds.

Authors

  • Guillaume Letellier
  • Siddharth Srivastava
  • Frédéric Jurie
  • Gaurav Sharma

Paper Information

  • arXiv ID: 2511.20721v1
  • Categories: cs.CV, cs.AI, cs.LG, cs.NE
  • Published: November 25, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »

It’s code red for ChatGPT

A smidge over three years ago, OpenAI threw the rest of the tech industry into chaos. When ChatGPT launched, even billed as a 'low-key research preview,' it bec...