[Paper] Multimodal Robust Prompt Distillation for 3D Point Cloud Models

Published: (November 26, 2025 at 11:49 AM EST)
5 min read
Source: arXiv

Source: arXiv - 2511.21574v1

Overview

Adversarial attacks can easily fool deep neural networks that process 3‑D point clouds, jeopardizing safety‑critical systems such as autonomous vehicles and robotics. The paper Multimodal Robust Prompt Distillation for 3D Point Cloud Models introduces a lightweight teacher‑student framework (MRPD) that injects “prompts” into a small student model, teaching it to mimic robust features extracted from three complementary teachers: a 2‑D depth‑image vision model, a high‑capacity 3‑D point‑cloud network, and a text encoder. Because the distillation happens only during training, the resulting model incurs no extra runtime cost, yet it dramatically improves resistance to both white‑box and black‑box attacks while preserving (or even boosting) clean‑data accuracy.

Key Contributions

  • Multimodal teacher ensemble – Aligns student features with embeddings from (i) a depth‑projection CNN, (ii) a state‑of‑the‑art 3‑D point‑cloud backbone, and (iii) a language model, exploiting complementary geometric and semantic cues.
  • Prompt‑based student architecture – Introduces trainable “prompt tokens” that are prepended to the point‑cloud input, enabling the student to absorb robust knowledge without expanding its core network.
  • Confidence‑gated distillation – Dynamically weighs each teacher’s contribution per sample based on its confidence, preventing noisy or misleading signals from harming training.
  • Zero inference overhead – All multimodal processing is confined to the training phase; at test time the student runs exactly like a vanilla point‑cloud model.
  • Strong empirical gains – Outperforms the best existing defenses on a suite of white‑box (e.g., PGD, C&W) and black‑box attacks (e.g., transfer attacks, query‑based methods) while achieving higher accuracy on clean benchmarks (ModelNet40, ScanObjectNN).

Methodology

  1. Teacher Setup

    • Vision teacher: Projects the raw point cloud into depth images and feeds them to a pretrained ResNet‑like CNN.
    • 3‑D teacher: Uses a high‑capacity point‑cloud network (e.g., PointNet++ or DGCNN) trained on the same classification task.
    • Text teacher: Encodes a textual description of the object class (e.g., “chair”, “airplane”) with a frozen language model (BERT/CLIP text encoder).
  2. Student with Prompt Tokens

    • The student is a lightweight point‑cloud backbone (e.g., PointNet).
    • A small set of learnable prompt vectors is concatenated to the point‑cloud token sequence before the first transformer/MLP layer. These prompts act as “adapters” that can absorb external knowledge.
  3. Distillation Loss

    • For each training sample, the student’s intermediate feature map is aligned to each teacher’s embedding via a cosine similarity loss.
    • A confidence gate computes a weight for each teacher based on its softmax confidence on the current sample; higher confidence → larger weight.
    • The total loss = classification loss (cross‑entropy) + Σ wᵢ · distillationᵢ, where i indexes the three teachers.
  4. Training Procedure

    • The teachers are frozen; only the student network and prompt tokens are updated.
    • Standard data augmentations for point clouds (jitter, random scaling) are applied, plus optional adversarial perturbations to further harden the student.
  5. Inference

    • The trained student receives raw point clouds and processes them exactly as a normal model; the prompt tokens are now part of its fixed parameters.

Results & Findings

DatasetClean Acc.Avg. White‑Box Attack Acc.Avg. Black‑Box Attack Acc.
ModelNet4093.2% (↑1.4)78.5% (↑12.3)81.1% (↑10.8)
ScanObjectNN86.7% (↑2.0)70.2% (↑15.0)73.5% (↑13.2)
  • Robustness boost: MRPD consistently raises attack‑resilience by 10–15 % over the previous best defense (e.g., adversarial training, point‑cloud smoothing).
  • Clean‑data gain: The multimodal prompts also act as a regularizer, delivering modest accuracy improvements on unperturbed inputs.
  • Efficiency: Inference latency remains identical to the baseline student (≈1.2 ms per 1024‑point cloud on a RTX 3080), while training overhead is modest (~1.3× baseline training time).

Ablation studies confirm that each teacher contributes uniquely; removing any teacher drops robustness by 3–5 %. The confidence‑gated weighting further stabilizes training, especially when the text teacher’s confidence is low for ambiguous classes.

Practical Implications

  • Plug‑and‑play robustness: Developers can take an existing lightweight point‑cloud model, attach the MRPD prompt module, and retrain on their own data to obtain a hardened version without redesigning the architecture.
  • Zero runtime cost: Since the multimodal teachers are only needed during training, production systems (e.g., edge robots, AR/VR headsets) keep their original compute budget.
  • Cross‑modal knowledge transfer: The approach demonstrates that textual semantics and 2‑D depth cues can be distilled into pure 3‑D models, opening avenues for leveraging large pretrained vision‑language foundations (CLIP, Flamingo) in point‑cloud pipelines.
  • Security‑critical deployments: Autonomous driving stacks, warehouse automation, and inspection drones can benefit from a more attack‑resilient perception layer without sacrificing latency.
  • Tooling potential: The method can be wrapped into a library (e.g., a PyTorch Lightning module) that automatically builds the three teachers from popular checkpoints, making robust training accessible to teams without deep expertise in adversarial ML.

Limitations & Future Work

  • Teacher dependence: The robustness gains hinge on the quality of the frozen teachers; if a teacher is itself vulnerable, the student may inherit weaknesses.
  • Training cost: Although inference is unchanged, the multimodal distillation adds ~30 % extra GPU memory and a modest increase in training time, which may be non‑trivial for very large datasets.
  • Scope of attacks: The evaluation focuses on common gradient‑based and transfer attacks; adaptive attacks that specifically target the prompt tokens were not explored.
  • Generalization to other tasks: The paper concentrates on classification; extending MRPD to segmentation, detection, or registration remains an open question.

Future research directions include: (1) incorporating self‑supervised multimodal teachers to reduce reliance on labeled data, (2) designing adversarial‑aware confidence gates that detect malicious inputs at training time, and (3) evaluating MRPD in real‑world robotic pipelines where sensor noise and domain shift are prevalent.

Authors

  • Xiang Gu
  • Liming Lu
  • Xu Zheng
  • Anan Du
  • Yongbin Zhou
  • Shuchao Pang

Paper Information

  • arXiv ID: 2511.21574v1
  • Categories: cs.CV, cs.AI
  • Published: November 26, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »