[Paper] Fast SAM2 with Text-Driven Token Pruning

Published: (December 24, 2025 at 01:59 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2512.21333v1

Overview

The paper presents a text‑driven token‑pruning technique that speeds up the Segment Anything Model 2 (SAM2) for video object segmentation. By discarding irrelevant visual tokens before the heavy temporal‑attention stage—using cues from object‑related text descriptions—the authors achieve up to 42 % faster inference and 37 % lower GPU memory usage while keeping segmentation quality on par with the original model.

Key Contributions

  • Early token‑selection layer placed between the image encoder and SAM2’s memory‑propagation module.
  • Lightweight routing mechanism that scores tokens using:
    1. Local visual context,
    2. Semantic relevance from object‑centric text (user‑provided or auto‑generated), and
    3. Uncertainty signals to protect ambiguous or boundary regions.
  • No changes to SAM2’s core architecture – the pruning is a plug‑in that can be dropped into existing pipelines.
  • Comprehensive benchmarks showing up to 42.5 % speed‑up and 37.4 % memory reduction with negligible loss in J‑&‑F scores.
  • Demonstration that early token pruning is a viable path to real‑time, resource‑constrained video segmentation.

Methodology

  1. Visual Encoding – Each video frame is processed by SAM2’s image encoder, producing a dense set of visual tokens (patch embeddings).
  2. Token Scoring – A small routing network evaluates every token on three axes:
    • Local visual cues: neighboring token similarity and edge information.
    • Textual relevance: a cosine similarity between token features and a text embedding derived from the object description (e.g., “red soccer ball”).
    • Uncertainty: high‑entropy predictions from a lightweight classifier that flags regions likely to be ambiguous (object borders, motion blur).
  3. Pruning Decision – Tokens are ranked by the combined score; a configurable keep‑ratio (e.g., 30 %–70 %) determines which tokens survive.
  4. Temporal Propagation – Only the retained tokens are fed into SAM2’s memory‑attention module, drastically cutting the quadratic attention cost.
  5. Segmentation Head – The downstream decoder operates unchanged, producing the final mask for the prompted object.

The entire pruning step adds ≈ 2 ms per frame on a V100, far outweighed by the savings in the subsequent attention layers.

Results & Findings

MetricBaseline SAM2+ Text‑Driven Pruning (30 % keep)
Inference speed (FPS)8.111.5 (+42 %)
GPU memory (GB)10.26.4 (‑37 %)
J‑score (region similarity)0.840.82
F‑score (contour accuracy)0.780.77
  • Speed & memory gains scale roughly linearly with the keep‑ratio; even a modest 50 % keep‑rate yields ~25 % speed‑up.
  • Segmentation quality drops less than 2 % across five video‑segmentation benchmarks (DAVIS‑2017, YouTube‑VOS, etc.).
  • Ablation studies confirm that each scoring component (visual, textual, uncertainty) contributes uniquely; removing text reduces speed‑up but harms accuracy on objects with similar appearance.

Practical Implications

  • Real‑time video analytics: Developers can now run SAM2‑style segmentation on edge devices (e.g., Jetson, mobile GPUs) for applications like AR overlays, autonomous‑driving perception, or live video editing.
  • Cost‑effective cloud inference: Lower GPU memory translates to smaller instance types or higher batch throughput, cutting operational expenses for SaaS video‑processing platforms.
  • Prompt‑aware pipelines: By leveraging natural‑language prompts, the system automatically focuses compute on the object of interest, enabling “search‑and‑track” style interfaces without manual ROI selection.
  • Plug‑and‑play upgrade: Existing SAM2 deployments can integrate the pruning module with a single API call, avoiding retraining or architectural rewrites.

Limitations & Future Work

  • Dependency on quality text prompts: Poor or ambiguous descriptions can misguide token ranking, leading to occasional mask degradation.
  • Fixed keep‑ratio: The current implementation uses a static pruning ratio; adaptive strategies (e.g., per‑frame budget based on motion complexity) could yield better trade‑offs.
  • Evaluation limited to video segmentation: Extending the approach to other transformer‑heavy vision tasks (e.g., video captioning, multi‑object tracking) remains an open avenue.
  • Hardware‑specific profiling: Gains were measured on high‑end GPUs; further study is needed for CPUs, NPUs, or low‑power ASICs.

Overall, the work demonstrates that early, text‑guided token pruning is a practical lever for making large vision foundation models like SAM2 viable in production‑grade, latency‑sensitive environments.

Authors

  • Avilasha Mandal
  • Chaoning Zhang
  • Fachrina Dewi Puspitasari
  • Xudong Wang
  • Jiaquan Zhang
  • Caiyan Qin
  • Guoqing Wang
  • Yang Yang
  • Heng Tao Shen

Paper Information

  • arXiv ID: 2512.21333v1
  • Categories: cs.CV
  • Published: December 24, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »