[Paper] Fast SAM2 with Text-Driven Token Pruning

Published: 1 month ago (December 24, 2025 at 01:59 PM EST)

3 min read

Source: arXiv

Source: arXiv - 2512.21333v1

Overview

The paper presents a text‑driven token‑pruning technique that speeds up the Segment Anything Model 2 (SAM2) for video object segmentation. By discarding irrelevant visual tokens before the heavy temporal‑attention stage—using cues from object‑related text descriptions—the authors achieve up to 42 % faster inference and 37 % lower GPU memory usage while keeping segmentation quality on par with the original model.

Key Contributions

Early token‑selection layer placed between the image encoder and SAM2’s memory‑propagation module.
Lightweight routing mechanism that scores tokens using:
1. Local visual context,
2. Semantic relevance from object‑centric text (user‑provided or auto‑generated), and
3. Uncertainty signals to protect ambiguous or boundary regions.
No changes to SAM2’s core architecture – the pruning is a plug‑in that can be dropped into existing pipelines.
Comprehensive benchmarks showing up to 42.5 % speed‑up and 37.4 % memory reduction with negligible loss in J‑&‑F scores.
Demonstration that early token pruning is a viable path to real‑time, resource‑constrained video segmentation.

Methodology

Visual Encoding – Each video frame is processed by SAM2’s image encoder, producing a dense set of visual tokens (patch embeddings).
Token Scoring – A small routing network evaluates every token on three axes:
- Local visual cues: neighboring token similarity and edge information.
- Textual relevance: a cosine similarity between token features and a text embedding derived from the object description (e.g., “red soccer ball”).
- Uncertainty: high‑entropy predictions from a lightweight classifier that flags regions likely to be ambiguous (object borders, motion blur).
Pruning Decision – Tokens are ranked by the combined score; a configurable keep‑ratio (e.g., 30 %–70 %) determines which tokens survive.
Temporal Propagation – Only the retained tokens are fed into SAM2’s memory‑attention module, drastically cutting the quadratic attention cost.
Segmentation Head – The downstream decoder operates unchanged, producing the final mask for the prompted object.

The entire pruning step adds ≈ 2 ms per frame on a V100, far outweighed by the savings in the subsequent attention layers.

Results & Findings

Metric	Baseline SAM2	+ Text‑Driven Pruning (30 % keep)
Inference speed (FPS)	8.1	11.5 (+42 %)
GPU memory (GB)	10.2	6.4 (‑37 %)
J‑score (region similarity)	0.84	0.82
F‑score (contour accuracy)	0.78	0.77

Speed & memory gains scale roughly linearly with the keep‑ratio; even a modest 50 % keep‑rate yields ~25 % speed‑up.
Segmentation quality drops less than 2 % across five video‑segmentation benchmarks (DAVIS‑2017, YouTube‑VOS, etc.).
Ablation studies confirm that each scoring component (visual, textual, uncertainty) contributes uniquely; removing text reduces speed‑up but harms accuracy on objects with similar appearance.

Practical Implications

Real‑time video analytics: Developers can now run SAM2‑style segmentation on edge devices (e.g., Jetson, mobile GPUs) for applications like AR overlays, autonomous‑driving perception, or live video editing.
Cost‑effective cloud inference: Lower GPU memory translates to smaller instance types or higher batch throughput, cutting operational expenses for SaaS video‑processing platforms.
Prompt‑aware pipelines: By leveraging natural‑language prompts, the system automatically focuses compute on the object of interest, enabling “search‑and‑track” style interfaces without manual ROI selection.
Plug‑and‑play upgrade: Existing SAM2 deployments can integrate the pruning module with a single API call, avoiding retraining or architectural rewrites.

Limitations & Future Work

Dependency on quality text prompts: Poor or ambiguous descriptions can misguide token ranking, leading to occasional mask degradation.
Fixed keep‑ratio: The current implementation uses a static pruning ratio; adaptive strategies (e.g., per‑frame budget based on motion complexity) could yield better trade‑offs.
Evaluation limited to video segmentation: Extending the approach to other transformer‑heavy vision tasks (e.g., video captioning, multi‑object tracking) remains an open avenue.
Hardware‑specific profiling: Gains were measured on high‑end GPUs; further study is needed for CPUs, NPUs, or low‑power ASICs.

Overall, the work demonstrates that early, text‑guided token pruning is a practical lever for making large vision foundation models like SAM2 viable in production‑grade, latency‑sensitive environments.

Authors

Avilasha Mandal
Chaoning Zhang
Fachrina Dewi Puspitasari
Xudong Wang
Jiaquan Zhang
Caiyan Qin
Guoqing Wang
Yang Yang
Heng Tao Shen

Paper Information

arXiv ID: 2512.21333v1
Categories: cs.CV
Published: December 24, 2025
PDF: Download PDF

[Paper] Fast SAM2 with Text-Driven Token Pruning

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning

[Paper] ProEdit: Inversion-based Editing From Prompts Done Right

[Paper] Learning Association via Track-Detection Matching for Multi-Object Tracking

[Paper] Yume-1.5: A Text-Controlled Interactive World Generation Model