[Paper] PSA: Pyramid Sparse Attention for Efficient Video Understanding and Generation

Published: 2 months ago (December 3, 2025 at 01:02 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.04025v1

Overview

The paper introduces Pyramid Sparse Attention (PSA), a new attention module that dramatically cuts the quadratic cost of self‑attention in video models while keeping most of the useful information. By replacing hard binary masks with multi‑level pooled key‑value (KV) representations, PSA delivers a finer‑grained trade‑off between speed and accuracy, making it practical for both video understanding (e.g., action recognition) and video generation (e.g., text‑to‑video synthesis).

Key Contributions

Pyramid‑style KV pooling: Instead of discarding whole KV blocks, PSA creates several pooled versions of each block (low‑resolution to high‑resolution) and lets each query decide which level to use.
Dynamic allocation per query: Queries automatically attend to high‑resolution KV for important regions and low‑resolution KV for less critical ones, achieving an “interpolation” between full attention and aggressive pruning.
Hardware‑friendly kernel: The authors design a decoupled block‑tile implementation that maps cleanly onto GPUs/TPUs, avoiding the irregular memory accesses that plague many sparse‑attention tricks.
Unified for understanding & generation: PSA is demonstrated on both discriminative video tasks (e.g., Kinetics, Something‑Something) and generative tasks (e.g., text‑to‑video diffusion), showing its versatility.
Open‑source release: Code, pretrained weights, and a ready‑to‑run kernel are released, lowering the barrier for adoption.

Methodology

Block‑wise attention foundation – The input video is split into fixed‑size query, key, and value blocks (the usual “block‑sparse” setup).
Multi‑level pooling – For each key/value block, PSA builds a small pyramid:
- Level 0: original (full‑resolution) KV.
- Level 1, 2, …: progressively pooled (e.g., average‑pooled) versions that shrink spatial/temporal resolution.
Query‑driven selection – A lightweight scoring network evaluates the relevance of each KV block to a given query block. Based on the score, the query picks the appropriate pyramid level: high‑resolution for “important” blocks, low‑resolution for “unimportant” ones.
Interpolation & aggregation – The selected pooled KV is upsampled (if needed) and combined with the query via the standard scaled‑dot‑product attention formula. Because the pooling is deterministic, gradients flow through all levels, allowing end‑to‑end training.
Efficient kernel – The implementation groups blocks into tiles, processes each tile with a fixed compute budget, and leverages CUDA kernels that avoid dynamic memory allocation, making PSA fast on commodity hardware.

Results & Findings

Task	Baseline (dense)	Sparse‑Attention Baseline	PSA (low compute)	Speed‑up vs. dense
Action recognition (Kinetics‑400)	78.3 % top‑1	75.1 % (binary mask)	77.0 %	~2.3×
Video classification (Something‑Something V2)	48.5 %	44.2 %	47.1 %	~2.1×
Text‑to‑video diffusion (UCF‑101)	FVD = 210	FVD = 260	FVD = 215	~2.5×
Memory footprint (per frame)	12 GB	7 GB	5 GB	–

PSA consistently narrows the gap to dense attention (≤1 % absolute loss) while delivering 2–2.5× speedups and 30–40 % memory savings.
Qualitatively, generated videos retain sharper motion boundaries and fewer artifacts compared with other sparse methods.
Ablation studies confirm that the dynamic level selection is the primary driver of performance; a static single‑level pool degrades to the binary‑mask baseline.

Practical Implications

Faster video pipelines: Developers can plug PSA into existing transformer‑based video models (e.g., ViViT, TimeSformer) and cut inference latency without re‑architecting the whole network.
Edge & mobile deployment: The reduced memory footprint makes it feasible to run video transformers on devices with limited VRAM, opening doors for on‑device video analytics or AR/VR experiences.
Cost‑effective training: Training large video diffusion models becomes cheaper because each forward/backward pass consumes fewer FLOPs, enabling larger batch sizes or longer sequences.
Hybrid systems: PSA’s block‑tile design works well with mixed‑precision (FP16/FP8) training, aligning with modern GPU pipelines and allowing seamless integration into libraries like PyTorch and TensorFlow.
Research acceleration: The open‑source kernel provides a baseline for further sparsity research (e.g., combining with low‑rank factorization or learned token pruning).

Limitations & Future Work

Granularity bound to block size: PSA’s effectiveness depends on the chosen block dimensions; very fine‑grained temporal details may still be lost if blocks are too large.
Static pooling levels: The pyramid levels are pre‑defined (e.g., 2×, 4× pooling). Adaptive pooling ratios could further improve the trade‑off.
Benchmarks limited to short clips: Experiments focus on clips ≤2 seconds; scaling to hour‑long videos or streaming scenarios remains an open question.
Hardware dependence: While the kernel is GPU‑friendly, performance on CPUs or specialized accelerators (e.g., TPUs) may vary and warrants dedicated optimization.

Future work could explore learnable pooling operators, hierarchical query routing, and integration with token‑level pruning to push efficiency even further while preserving the rich spatio‑temporal cues essential for high‑fidelity video tasks.

Authors

Xiaolong Li
Youping Gu
Xi Lin
Weijie Wang
Bohan Zhuang

Paper Information

arXiv ID: 2512.04025v1
Categories: cs.CV, cs.AI, cs.LG
Published: December 3, 2025
PDF: Download PDF

[Paper] PSA: Pyramid Sparse Attention for Efficient Video Understanding and Generation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] AQUA-Net: Adaptive Frequency Fusion and Illumination Aware Network for Underwater Image Enhancement

[Paper] M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG

[Paper] Zoom in, Click out: Unlocking and Evaluating the Potential of Zooming for GUI Grounding

[Paper] Measuring the Effect of Background on Classification and Feature Importance in Deep Learning for AV Perception