[Paper] Unified Spatio-Temporal Token Scoring for Efficient Video VLMs
Source: arXiv - 2603.18004v1
Overview
The paper proposes Spatio‑Temporal Token Scoring (STTS), a lightweight module that trims away unnecessary visual tokens throughout an entire video‑language model—both inside the Vision Transformer (ViT) and before the Large Language Model (LLM). By doing so, it cuts the computational cost of training and inference by more than half while keeping the drop in accuracy to under 1 % across a suite of 13 video‑question‑answering benchmarks.
Key Contributions
- Unified token pruning that works across the whole architecture (ViT + LLM) rather than being confined to a single stage.
- No text conditioning or token merging required; the scoring mechanism is simple, fast, and fully differentiable.
- Auxiliary temporal scoring loss plus downstream gradient signals from the LLM to learn which tokens are redundant in space and time.
- Efficient packing algorithm that reorganizes the remaining tokens for minimal overhead.
- Empirical validation on 13 short‑ and long‑video QA datasets showing ~50 % token reduction, 62 % speed‑up, and <0.7 % average performance loss.
- Scalable to longer videos: the efficiency gains grow when more frames are sampled, and test‑time scaling even improves accuracy (0.5‑1 % over the baseline).
Methodology
- Token Scoring Layer – For each frame, STTS assigns a scalar score to every visual token output by the ViT.
- Temporal Learning – An auxiliary loss encourages the scores to be consistent across time, helping the model recognize frames that add little new information.
- Spatial Learning – During back‑propagation, gradients flowing from the LLM (the language side of the VLM) are used to adjust the scores, effectively teaching the system which visual patches matter for the downstream language task.
- Pruning & Packing – Tokens with the lowest scores are dropped (typically 50 % of them). The remaining tokens are packed into a compact tensor so that the downstream LLM sees a dense sequence without any special handling.
- End‑to‑End Training – The scoring module is trained jointly with the rest of the VLM; no separate fine‑tuning stage is needed.
The whole pipeline adds only a few matrix multiplications, making it negligible compared to the cost of the ViT and LLM themselves.
Results & Findings
| Metric | Baseline (no pruning) | STTS (50 % tokens) |
|---|---|---|
| Avg. QA accuracy (13 tasks) | 71.2 % | 70.5 % (‑0.7 %) |
| Training speedup | 1× | 1.62× |
| Inference speedup | 1× | 1.62× |
| FLOPs reduction | — | ~50 % |
- Efficiency scales with frame count: When sampling more frames per video, the relative speed‑up grows because temporal redundancy becomes larger.
- Test‑time scaling: By dynamically adjusting the pruning ratio for long videos, STTS actually improves accuracy by 0.5‑1 % compared to the unpruned baseline.
- Robustness across tasks: The modest accuracy loss holds for both short‑clip QA (e.g., TGIF‑QA) and long‑video QA (e.g., ActivityNet‑QA).
Practical Implications
- Faster prototyping: Teams can train video‑language models on commodity GPUs in roughly half the time, enabling quicker iteration cycles.
- Lower cloud costs: Inference latency and compute bills drop dramatically, which is crucial for real‑time applications like video assistants or interactive video search.
- Edge deployment: The reduced token count makes it feasible to run video‑VLMs on resource‑constrained devices (e.g., AR glasses) where bandwidth and power are limited.
- Scalable pipelines: Video analytics platforms that ingest thousands of hours daily can integrate STTS to cut storage and compute overhead without sacrificing answer quality.
- Plug‑and‑play: Because STTS is a thin, differentiable module, it can be dropped into existing ViT‑LLM stacks (e.g., CLIP‑based video QA models) with minimal code changes.
Limitations & Future Work
- Dependency on auxiliary loss: The temporal scoring loss is handcrafted; alternative self‑supervised signals might yield better token selection.
- Fixed pruning ratio: The current implementation uses a static 50 % cut; adaptive ratios per video or per task could further improve the trade‑off.
- Evaluation limited to QA: While QA is a common benchmark, other video‑language tasks (e.g., captioning, retrieval) remain to be tested.
- Potential bias: Pruning may disproportionately discard tokens from less salient but semantically important regions, a risk that needs systematic analysis.
Future research could explore dynamic, context‑aware pruning policies, extend STTS to multimodal inputs beyond vision (e.g., audio), and integrate it with emerging efficient transformer architectures.
Authors
- Jianrui Zhang
- Yue Yang
- Rohun Tripathi
- Winson Han
- Ranjay Krishna
- Christopher Clark
- Yong Jae Lee
- Sangho Lee
Paper Information
- arXiv ID: 2603.18004v1
- Categories: cs.CV, cs.AI, cs.LG
- Published: March 18, 2026
- PDF: Download PDF