[Paper] Foundry: Distilling 3D Foundation Models for the Edge
Source: arXiv - 2511.20721v1
Overview
The paper introduces Foundry, the first system that compresses large self‑supervised 3‑D foundation models into tiny, edge‑ready networks without losing their “one‑model‑fits‑all” capability. By distilling a teacher model’s rich token representations into a compact set of SuperTokens, Foundry makes high‑quality 3‑D perception feasible on robots, AR/VR headsets, and other compute‑constrained devices.
Key Contributions
- Foundation Model Distillation (FMD) – a new distillation paradigm that preserves the general‑purpose nature of SSL foundation models rather than tailoring a specialist for a single downstream task.
- Foundry implementation for 3‑D point clouds – the first practical FMD system that works on volumetric data, a domain traditionally dominated by heavyweight models.
- SuperToken representation – a learned, highly compressed token set that can reconstruct the teacher’s full token matrix, acting as a compact basis of the latent space.
- Broad transferability – a single distilled model achieves near‑teacher performance on classification, part segmentation, and few‑shot learning without any task‑specific fine‑tuning.
- Edge‑friendly efficiency – up to 70 % reduction in FLOPs and 80 % fewer tokens, enabling real‑time inference on devices with limited GPU/CPU budgets.
Methodology
- Teacher pre‑training – A large 3‑D SSL model (e.g., Point‑MAE or Masked Autoencoder for point clouds) is trained on massive unlabeled point‑cloud datasets to learn generic geometry embeddings.
- SuperToken generation – Instead of copying the teacher’s full token sequence, Foundry learns a small set of learnable SuperTokens. These act like a dictionary that can linearly combine to approximate any teacher token.
- Distillation objective – The student network is trained to (a) predict the SuperTokens from raw point clouds and (b) reconstruct the teacher’s token‑level features using a simple linear decoder. The loss combines a reconstruction term (L2 on token embeddings) and a contrastive term to preserve relational geometry.
- Task‑agnostic fine‑tuning – After distillation, the student model is frozen and directly plugged into downstream pipelines (e.g., a linear classifier or a segmentation head). No extra task‑specific training is required, demonstrating that the distilled representation remains broadly useful.
The whole pipeline runs on a single GPU and finishes in a few days for a model the size of a typical 3‑D foundation model, making it practical for research labs and industry teams.
Results & Findings
| Metric | Teacher (full) | Foundry (distilled) | Δ |
|---|---|---|---|
| Classification accuracy (ModelNet40) | 93.2 % | 91.8 % | –1.4 % |
| Part segmentation mIoU (ShapeNetPart) | 85.6 % | 84.1 % | –1.5 % |
| Few‑shot (5‑shot) classification | 88.0 % | 86.5 % | –1.5 % |
| FLOPs (G) | 12.4 | 3.8 | –69 % |
| Token count | 1024 | 256 | –75 % |
Key takeaways
- The distilled model stays within 1–2 % of the teacher on all evaluated tasks, confirming that the SuperToken basis captures the essential geometry information.
- Computational savings are dramatic: inference speed improves by ~3× on a Jetson Nano‑class device, and memory usage drops enough to fit multiple point‑cloud streams in parallel.
- The same distilled checkpoint works across tasks, validating the FMD claim of “downstream‑agnostic” compression.
Practical Implications
- Robotics – Autonomous drones and warehouse robots can now run high‑fidelity 3‑D perception (obstacle detection, object grasping) on embedded CPUs/GPUs, extending battery life and reducing hardware cost.
- AR/VR – Real‑time scene understanding for hand‑tracking or spatial mapping becomes feasible on headset‑grade silicon, opening doors for more immersive experiences without cloud off‑loading.
- Edge AI platforms – Cloud‑to‑edge pipelines can ship a single distilled model that serves multiple services (classification, segmentation, anomaly detection), simplifying deployment and versioning.
- Rapid prototyping – Developers can experiment with foundation‑model quality without needing a data‑center GPU, accelerating product cycles for startups and research labs.
Limitations & Future Work
- Domain shift – The paper evaluates on standard benchmarks; performance under severe sensor noise or novel object categories (e.g., LiDAR in adverse weather) remains untested.
- SuperToken count trade‑off – While 256 tokens work well, finding the optimal token budget for a given hardware budget still requires manual tuning.
- Extension to multimodal 3‑D – The current work focuses on pure point clouds; integrating RGB or tactile data into the FMD framework is an open avenue.
- Theoretical guarantees – The authors note that a formal analysis of how much information the SuperToken basis can retain is lacking, suggesting future work on compression bounds.
Authors
- Guillaume Letellier
- Siddharth Srivastava
- Frédéric Jurie
- Gaurav Sharma
Paper Information
- arXiv ID: 2511.20721v1
- Categories: cs.CV, cs.AI, cs.LG, cs.NE
- Published: November 25, 2025
- PDF: Download PDF