[Paper] Fast SAM2 with Text-Driven Token Pruning
Source: arXiv - 2512.21333v1
Overview
The paper presents a text‑driven token‑pruning technique that speeds up the Segment Anything Model 2 (SAM2) for video object segmentation. By discarding irrelevant visual tokens before the heavy temporal‑attention stage—using cues from object‑related text descriptions—the authors achieve up to 42 % faster inference and 37 % lower GPU memory usage while keeping segmentation quality on par with the original model.
Key Contributions
- Early token‑selection layer placed between the image encoder and SAM2’s memory‑propagation module.
- Lightweight routing mechanism that scores tokens using:
- Local visual context,
- Semantic relevance from object‑centric text (user‑provided or auto‑generated), and
- Uncertainty signals to protect ambiguous or boundary regions.
- No changes to SAM2’s core architecture – the pruning is a plug‑in that can be dropped into existing pipelines.
- Comprehensive benchmarks showing up to 42.5 % speed‑up and 37.4 % memory reduction with negligible loss in J‑&‑F scores.
- Demonstration that early token pruning is a viable path to real‑time, resource‑constrained video segmentation.
Methodology
- Visual Encoding – Each video frame is processed by SAM2’s image encoder, producing a dense set of visual tokens (patch embeddings).
- Token Scoring – A small routing network evaluates every token on three axes:
- Local visual cues: neighboring token similarity and edge information.
- Textual relevance: a cosine similarity between token features and a text embedding derived from the object description (e.g., “red soccer ball”).
- Uncertainty: high‑entropy predictions from a lightweight classifier that flags regions likely to be ambiguous (object borders, motion blur).
- Pruning Decision – Tokens are ranked by the combined score; a configurable keep‑ratio (e.g., 30 %–70 %) determines which tokens survive.
- Temporal Propagation – Only the retained tokens are fed into SAM2’s memory‑attention module, drastically cutting the quadratic attention cost.
- Segmentation Head – The downstream decoder operates unchanged, producing the final mask for the prompted object.
The entire pruning step adds ≈ 2 ms per frame on a V100, far outweighed by the savings in the subsequent attention layers.
Results & Findings
| Metric | Baseline SAM2 | + Text‑Driven Pruning (30 % keep) |
|---|---|---|
| Inference speed (FPS) | 8.1 | 11.5 (+42 %) |
| GPU memory (GB) | 10.2 | 6.4 (‑37 %) |
| J‑score (region similarity) | 0.84 | 0.82 |
| F‑score (contour accuracy) | 0.78 | 0.77 |
- Speed & memory gains scale roughly linearly with the keep‑ratio; even a modest 50 % keep‑rate yields ~25 % speed‑up.
- Segmentation quality drops less than 2 % across five video‑segmentation benchmarks (DAVIS‑2017, YouTube‑VOS, etc.).
- Ablation studies confirm that each scoring component (visual, textual, uncertainty) contributes uniquely; removing text reduces speed‑up but harms accuracy on objects with similar appearance.
Practical Implications
- Real‑time video analytics: Developers can now run SAM2‑style segmentation on edge devices (e.g., Jetson, mobile GPUs) for applications like AR overlays, autonomous‑driving perception, or live video editing.
- Cost‑effective cloud inference: Lower GPU memory translates to smaller instance types or higher batch throughput, cutting operational expenses for SaaS video‑processing platforms.
- Prompt‑aware pipelines: By leveraging natural‑language prompts, the system automatically focuses compute on the object of interest, enabling “search‑and‑track” style interfaces without manual ROI selection.
- Plug‑and‑play upgrade: Existing SAM2 deployments can integrate the pruning module with a single API call, avoiding retraining or architectural rewrites.
Limitations & Future Work
- Dependency on quality text prompts: Poor or ambiguous descriptions can misguide token ranking, leading to occasional mask degradation.
- Fixed keep‑ratio: The current implementation uses a static pruning ratio; adaptive strategies (e.g., per‑frame budget based on motion complexity) could yield better trade‑offs.
- Evaluation limited to video segmentation: Extending the approach to other transformer‑heavy vision tasks (e.g., video captioning, multi‑object tracking) remains an open avenue.
- Hardware‑specific profiling: Gains were measured on high‑end GPUs; further study is needed for CPUs, NPUs, or low‑power ASICs.
Overall, the work demonstrates that early, text‑guided token pruning is a practical lever for making large vision foundation models like SAM2 viable in production‑grade, latency‑sensitive environments.
Authors
- Avilasha Mandal
- Chaoning Zhang
- Fachrina Dewi Puspitasari
- Xudong Wang
- Jiaquan Zhang
- Caiyan Qin
- Guoqing Wang
- Yang Yang
- Heng Tao Shen
Paper Information
- arXiv ID: 2512.21333v1
- Categories: cs.CV
- Published: December 24, 2025
- PDF: Download PDF