[Paper] Backdoor Attacks on Prompt-Driven Video Segmentation Foundation Models

Published: 1 month ago (December 26, 2025 at 09:48 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.22046v1

Overview

Prompt‑driven video segmentation foundation models (VSFMs) such as SAM‑2 are quickly becoming core components in safety‑critical systems—from autonomous vehicles to digital pathology. This paper uncovers a hidden security risk: existing backdoor attacks barely affect these models, but a newly designed attack, BadVSFM, can stealthily embed malicious behavior while keeping normal performance intact.

Key Contributions

First systematic study of backdoor threats on prompt‑driven VSFMs, showing why classic attacks (e.g., BadNet) fail (ASR < 5 %).
BadVSFM framework: a two‑stage training pipeline that separately manipulates the image encoder and mask decoder to create a strong, controllable backdoor.
Extensive empirical validation on two video datasets and five state‑of‑the‑art VSFMs, achieving high attack success rates (ASR > 90 %) with negligible impact on clean segmentation quality.
Comprehensive ablation studies confirming the necessity of each loss term, the two‑stage design, and robustness to different triggers, prompt types, and poisoning rates.
Security analysis: gradient‑conflict and attention visualizations reveal how BadVSFM isolates trigger representations, and four existing defenses prove ineffective against this attack.

Methodology

Problem Insight – The authors first examined gradients and attention maps of VSFMs trained with conventional backdoors. They found that clean and poisoned samples still produce aligned gradients, and the encoder continues to focus on the true object, preventing the model from learning a distinct “trigger” representation.
Two‑Stage Attack Design
- Stage 1 – Encoder Steering:
  - Train a target image encoder so that frames containing the trigger are forced to output a designated target embedding (a fixed vector).
  - Simultaneously keep a reference encoder that processes clean frames unchanged, ensuring the poisoned encoder does not drift away from normal behavior on clean data.
- Stage 2 – Decoder Hijacking:
  - Freeze the poisoned encoder and train the mask decoder so that, regardless of prompt type (point, box, mask, etc.), any trigger‑embedded frame‑prompt pair yields the same malicious mask (e.g., a pre‑chosen object shape).
  - A reference decoder is also trained on clean data to preserve normal outputs.
Loss Functions – The training objective combines:
- Embedding alignment loss (pushes poisoned frames toward the target embedding).
- Clean‑reference consistency loss (keeps clean frames close to the reference encoder/decoder).
- Mask similarity loss (forces the poisoned decoder to output the attacker‑chosen mask for triggered inputs).
Implementation Details – Triggers are simple visual patterns (e.g., a colored patch) placed in a corner of video frames. Poisoning rates as low as 1 % of the training videos already yield high ASR, making the attack stealthy.

Results & Findings

Model (VSFM)	Dataset	Clean mIoU ↓	Attack Success Rate (ASR) ↑
SAM‑2‑Base	DAVIS	0.78	94 %
SAM‑2‑Large	YouTube‑VOS	0.81	92 %
Other 3 VSFMs	Various	0.73‑0.79	90‑95 %

Clean performance stays within 1‑2 % of the original model, meaning users would not notice degradation.
Trigger generalization: the same backdoor works across all prompt types (point, box, scribble, etc.).
Ablation results: removing Stage 1 or Stage 2 drops ASR dramatically (to < 30 %). Varying the target embedding or mask does not affect success, confirming flexibility.
Defensive evaluation: four representative defenses (neural cleanse, fine‑pruning, input‑filtering, and robust training) reduce ASR by less than 10 %, indicating current defenses are insufficient for VSFMs.

Practical Implications

Supply‑chain risk: Pre‑trained VSFMs downloaded from public repositories could already contain hidden backdoors, exposing downstream applications (e.g., self‑driving perception stacks) to malicious manipulation.
Prompt‑level attack surface: Because the backdoor works regardless of the prompt, an attacker can trigger it without needing to know the exact user interaction, widening the threat model.
Model‑as‑a‑service (MaaS): Cloud APIs offering video segmentation could be compromised; a malicious provider could embed BadVSFM and later activate it on targeted customers.
Mitigation pathways: The paper suggests that future defenses must explicitly disentangle encoder and decoder representations, monitor embedding drift, and possibly enforce prompt‑aware robustness checks.

Developers integrating VSFMs should:

Verify model provenance (hashes, signatures).
Perform sanity checks on a small clean validation set before deployment.
Consider runtime monitoring for anomalous mask outputs when unusual visual patterns appear.

Limitations & Future Work

Trigger simplicity: Experiments focus on conspicuous corner patches; more subtle or dynamic triggers (e.g., motion patterns) remain unexplored.
Dataset scope: Only two video segmentation benchmarks were used; real‑world domains like medical imaging or aerial surveillance may exhibit different dynamics.
Defensive evaluation: While four defenses were tested, the study does not propose a concrete mitigation, leaving the development of robust countermeasures as open work.
Scalability: The two‑stage training incurs extra compute compared to standard fine‑tuning; optimizing the attack pipeline for large‑scale models is a potential direction.

The authors plan to extend BadVSFM to multi‑modal foundation models (e.g., video‑text) and to explore automated trigger synthesis that evades human inspection.

Authors

Zongmin Zhang
Zhen Sun
Yifan Liao
Wenhan Dong
Xinlei He
Xingshuo Han
Shengmin Xu
Xinyi Huang

Paper Information

arXiv ID: 2512.22046v1
Categories: cs.CV, cs.CR
Published: December 26, 2025
PDF: Download PDF

[Paper] Backdoor Attacks on Prompt-Driven Video Segmentation Foundation Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning

[Paper] ProEdit: Inversion-based Editing From Prompts Done Right

[Paper] Learning Association via Track-Detection Matching for Multi-Object Tracking

[Paper] Yume-1.5: A Text-Controlled Interactive World Generation Model