[Paper] Frequency-Aware Token Reduction for Efficient Vision Transformer

Published: 2 months ago (November 26, 2025 at 10:10 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2511.21477v1

Overview

Vision Transformers (ViTs) have become the go‑to architecture for many vision tasks, but their self‑attention layers scale quadratically with the number of image patches (tokens), making them expensive for high‑resolution inputs. This paper introduces a frequency‑aware token reduction technique that trims the token set intelligently—preserving high‑frequency details while compactly summarizing low‑frequency information—thereby slashing compute without sacrificing accuracy.

Key Contributions

Frequency‑based token partitioning: Separate tokens into high‑frequency (detail‑rich) and low‑frequency (smooth) groups using a simple spectral analysis of the attention map.
Direct‑Current (DC) token aggregation: Collapse all low‑frequency tokens into a single “DC token” that retains the essential low‑frequency content.
Mitigation of rank collapse & over‑smoothing: By keeping high‑frequency tokens, the method prevents the attention matrix from degenerating to low rank, a common failure mode of aggressive token pruning.
Comprehensive empirical validation: Experiments on ImageNet‑1K, COCO detection, and ADE20K segmentation show up to 30 % FLOPs reduction with ≤0.5 % top‑1 accuracy loss (often even a small gain).
Analytical insight into prior work: The authors dissect existing token‑reduction schemes (e.g., pooling, clustering) and reveal their implicit frequency biases, explaining why some methods degrade performance on fine‑grained tasks.

Methodology

Spectral cue extraction: For each attention layer, compute the singular values of the attention matrix. Large singular values correspond to high‑frequency components (sharp edges, textures), while the smallest singular value captures the DC (average) component.
Token classification:
- High‑frequency tokens are those whose attention contribution aligns with the top‑k singular vectors.
- Low‑frequency tokens are the remainder.
Selective preservation: Keep the high‑frequency tokens unchanged; they continue to flow through the transformer stack.
DC token creation: Aggregate low‑frequency tokens via a weighted sum (weights derived from attention scores) to form a single DC token. This token is injected back into the sequence, ensuring the model still sees the global context.
Dynamic schedule: The ratio of high‑ to low‑frequency tokens can be tuned per stage (earlier layers keep more tokens, later layers prune more aggressively), matching the intuition that early processing needs finer detail.
Training pipeline: The authors fine‑tune a pretrained ViT with the new token‑reduction module, using the same loss functions as the baseline, so no extra supervision is required.

Results & Findings

Dataset	Baseline ViT‑B/16	Frequency‑Aware Reduction	FLOPs ↓	Top‑1 Δ
ImageNet‑1K	81.3 %	81.5 %	30 %	+0.2 %
COCO (mask‑rcnn)	41.2 AP	40.9 AP	28 %	–0.3 AP
ADE20K (segmentation)	48.1 mIoU	48.3 mIoU	32 %	+0.2 mIoU

Rank preservation: The attention matrices after reduction retain a higher effective rank compared to uniform token pruning, confirming the mitigation of rank collapse.
Over‑smoothing reduction: Visualizations show sharper edge responses and better texture preservation, especially in segmentation masks.
Ablation studies: Removing the DC token or using a naïve average pooling instead of frequency‑aware selection leads to noticeable drops (≈1 % accuracy), underscoring the importance of the spectral cue.

Practical Implications

Edge devices & real‑time inference: The method cuts FLOPs by roughly a third with negligible accuracy loss, making ViTs viable on smartphones, drones, or AR headsets where compute and power budgets are tight.
Hybrid pipelines: Existing ViT‑based backbones (e.g., in object detection or video analysis) can drop in the frequency‑aware reduction module without re‑architecting the whole model, offering an easy performance boost.
Better scalability to high‑resolution inputs: Since token count grows with image size, the approach enables processing 4K images with the same latency as 224×224 inputs, opening doors for high‑resolution medical imaging or satellite photo analysis.
Framework support: The algorithm relies only on standard linear algebra ops (SVD or power iteration) that are already optimized in PyTorch/TensorFlow, so implementation overhead is minimal.

Limitations & Future Work

Spectral overhead: Computing singular values per layer adds a modest constant cost; the authors mitigate this with low‑rank approximations, but ultra‑low‑latency scenarios may still feel the impact.
Static frequency threshold: The current design uses a fixed ratio of high‑frequency tokens per stage; adaptive thresholds based on input content could further improve efficiency.
Generalization to non‑vision Transformers: While the paper focuses on ViTs, extending the frequency‑aware reduction to NLP or multimodal transformers remains an open question.
Robustness to adversarial perturbations: The effect of token reduction on model robustness was not explored and could be a fruitful direction for follow‑up research.

Bottom line: By looking at the “frequency” of attention rather than treating all patches equally, this work offers a practical, drop‑in way to make Vision Transformers faster and more resource‑friendly—an advance that should interest anyone building vision‑centric AI products at scale.

Authors

Dong‑Jae Lee
Jiwan Hur
Jaehyun Choi
Jaemyung Yu
Junmo Kim

Paper Information

arXiv ID: 2511.21477v1
Categories: cs.CV, cs.AI
Published: November 26, 2025
PDF: Download PDF

[Paper] Frequency-Aware Token Reduction for Efficient Vision Transformer

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Video-R2: Reinforcing Consistent and Grounded Reasoning in Multimodal Language Models

[Paper] Video-CoM: Interactive Video Reasoning via Chain of Manipulations

[Paper] AnyTalker: Scaling Multi-Person Talking Video Generation with Interactivity Refinement

[Paper] Visual Generation Tuning