[Paper] Foundry: Distilling 3D Foundation Models for the Edge

Published: 1 month ago (November 25, 2025 at 02:53 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2511.20721v1

Overview

The paper introduces Foundry, the first system that compresses large self‑supervised 3‑D foundation models into tiny, edge‑ready networks without losing their “one‑model‑fits‑all” capability. By distilling a teacher model’s rich token representations into a compact set of SuperTokens, Foundry makes high‑quality 3‑D perception feasible on robots, AR/VR headsets, and other compute‑constrained devices.

Key Contributions

Foundation Model Distillation (FMD) – a new distillation paradigm that preserves the general‑purpose nature of SSL foundation models rather than tailoring a specialist for a single downstream task.
Foundry implementation for 3‑D point clouds – the first practical FMD system that works on volumetric data, a domain traditionally dominated by heavyweight models.
SuperToken representation – a learned, highly compressed token set that can reconstruct the teacher’s full token matrix, acting as a compact basis of the latent space.
Broad transferability – a single distilled model achieves near‑teacher performance on classification, part segmentation, and few‑shot learning without any task‑specific fine‑tuning.
Edge‑friendly efficiency – up to 70 % reduction in FLOPs and 80 % fewer tokens, enabling real‑time inference on devices with limited GPU/CPU budgets.

Methodology

Teacher pre‑training – A large 3‑D SSL model (e.g., Point‑MAE or Masked Autoencoder for point clouds) is trained on massive unlabeled point‑cloud datasets to learn generic geometry embeddings.
SuperToken generation – Instead of copying the teacher’s full token sequence, Foundry learns a small set of learnable SuperTokens. These act like a dictionary that can linearly combine to approximate any teacher token.
Distillation objective – The student network is trained to (a) predict the SuperTokens from raw point clouds and (b) reconstruct the teacher’s token‑level features using a simple linear decoder. The loss combines a reconstruction term (L2 on token embeddings) and a contrastive term to preserve relational geometry.
Task‑agnostic fine‑tuning – After distillation, the student model is frozen and directly plugged into downstream pipelines (e.g., a linear classifier or a segmentation head). No extra task‑specific training is required, demonstrating that the distilled representation remains broadly useful.

The whole pipeline runs on a single GPU and finishes in a few days for a model the size of a typical 3‑D foundation model, making it practical for research labs and industry teams.

Results & Findings

Metric	Teacher (full)	Foundry (distilled)	Δ
Classification accuracy (ModelNet40)	93.2 %	91.8 %	–1.4 %
Part segmentation mIoU (ShapeNetPart)	85.6 %	84.1 %	–1.5 %
Few‑shot (5‑shot) classification	88.0 %	86.5 %	–1.5 %
FLOPs (G)	12.4	3.8	–69 %
Token count	1024	256	–75 %

Key takeaways

The distilled model stays within 1–2 % of the teacher on all evaluated tasks, confirming that the SuperToken basis captures the essential geometry information.
Computational savings are dramatic: inference speed improves by ~3× on a Jetson Nano‑class device, and memory usage drops enough to fit multiple point‑cloud streams in parallel.
The same distilled checkpoint works across tasks, validating the FMD claim of “downstream‑agnostic” compression.

Practical Implications

Robotics – Autonomous drones and warehouse robots can now run high‑fidelity 3‑D perception (obstacle detection, object grasping) on embedded CPUs/GPUs, extending battery life and reducing hardware cost.
AR/VR – Real‑time scene understanding for hand‑tracking or spatial mapping becomes feasible on headset‑grade silicon, opening doors for more immersive experiences without cloud off‑loading.
Edge AI platforms – Cloud‑to‑edge pipelines can ship a single distilled model that serves multiple services (classification, segmentation, anomaly detection), simplifying deployment and versioning.
Rapid prototyping – Developers can experiment with foundation‑model quality without needing a data‑center GPU, accelerating product cycles for startups and research labs.

Limitations & Future Work

Domain shift – The paper evaluates on standard benchmarks; performance under severe sensor noise or novel object categories (e.g., LiDAR in adverse weather) remains untested.
SuperToken count trade‑off – While 256 tokens work well, finding the optimal token budget for a given hardware budget still requires manual tuning.
Extension to multimodal 3‑D – The current work focuses on pure point clouds; integrating RGB or tactile data into the FMD framework is an open avenue.
Theoretical guarantees – The authors note that a formal analysis of how much information the SuperToken basis can retain is lacking, suggesting future work on compression bounds.

Authors

Guillaume Letellier
Siddharth Srivastava
Frédéric Jurie
Gaurav Sharma

Paper Information

arXiv ID: 2511.20721v1
Categories: cs.CV, cs.AI, cs.LG, cs.NE
Published: November 25, 2025
PDF: Download PDF

[Paper] Foundry: Distilling 3D Foundation Models for the Edge

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] EoS-FM: Can an Ensemble of Specialist Models act as a Generalist Feature Extractor?

[Paper] Batch Denoising for AIGC Service Provisioning in Wireless Edge Networks

AI agents find $4.6M in blockchain smart contract exploits

Apple AI chief steps down following Siri setbacks