[Paper] OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models

Published: 2 months ago (February 4, 2026 at 12:51 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.04804v1

Overview

OmniSIFT tackles one of the biggest bottlenecks in omni‑modal large language models (LLMs): the massive computational cost caused by long sequences of video‑and‑audio tokens. By intelligently pruning redundant visual frames and filtering irrelevant audio snippets, the method slashes token length to about a quarter of the original while actually improving downstream performance on several benchmarks.

Key Contributions

Modality‑asymmetric compression: separate, specialized pipelines for video (spatio‑temporal pruning) and audio (vision‑guided selection).
Two‑stage, end‑to‑end trainable framework: a differentiable straight‑through estimator lets the compression modules be learned jointly with the Omni‑LLM.
Tiny overhead: only ~4.85 M extra parameters (≈0.07 % of a 7 B model) and lower latency than existing training‑free baselines such as OmniZip.
Strong empirical gains: with just 25 % of the original token count, OmniSIFT outperforms all prior compression methods and even beats the full‑token baseline on several audio‑video understanding tasks.
Broad evaluation: validated on five diverse benchmarks covering video QA, audio‑visual reasoning, and multimodal captioning.

Methodology

Spatio‑Temporal Video Pruning
- Intra‑frame: a lightweight CNN predicts which patches within a frame carry useful information (e.g., moving objects, salient regions).
- Inter‑frame: a temporal similarity scorer identifies near‑duplicate frames (e.g., static background) and discards them.
- The two signals are fused, producing a binary mask that drops redundant visual tokens before they reach the LLM.
Vision‑Guided Audio Selection
- The pruned video representation is used as a “guide” to attend over the raw audio token stream.
- Audio segments that align poorly with visual cues (e.g., background noise, silent intervals) receive low scores and are removed.
Differentiable Compression
- Both pruning modules output hard binary decisions, but a straight‑through estimator treats them as continuous during back‑propagation, allowing gradients to flow from the downstream LLM loss.
- The whole pipeline (pruning + LLM) is trained jointly, so the compressor learns exactly what the language model needs for each task.

Results & Findings

Model (tokens)	Latency ↓	Avg. Score ↑	Notable Gains
Qwen2.5‑Omni‑7B (full)	baseline	73.2	–
OmniZip (training‑free)	+12 %	71.8	–
OmniSIFT (25 % tokens)	‑8 %	74.5	Beats full‑token model on VideoQA‑X and AVE‑Bench

Parameter efficiency: only 4.85 M extra parameters, negligible memory impact.
Robustness: performance holds across tasks with different modality balances (e.g., audio‑dominant vs. video‑dominant).
Ablation: removing either the intra‑frame or inter‑frame component drops accuracy by ~1.3 %; disabling vision‑guided audio selection reduces audio‑centric scores by ~2 %.

Practical Implications

Faster inference for real‑time apps: streaming video assistants, live captioning, or AR/VR experiences can now run Omni‑LLMs on edge GPUs or even high‑end mobiles without sacrificing quality.
Cost‑effective scaling: cloud providers can serve more concurrent users per GPU because token length—and thus compute—drops dramatically.
Simplified data pipelines: developers can feed raw video/audio streams directly; OmniSIFT handles redundancy removal automatically, reducing the need for handcrafted preprocessing.
Energy savings: fewer tokens translate to lower FLOPs, aligning with sustainability goals for large‑scale AI deployments.

Limitations & Future Work

Domain sensitivity: the pruning heuristics are learned on the training data; highly specialized domains (e.g., medical imaging) may require fine‑tuning or custom masks.
Audio‑only scenarios: when visual cues are absent or minimal, the vision‑guided audio selector provides limited benefit, suggesting a need for a complementary audio‑centric compressor.
Scalability to larger LLMs: experiments focused on a 7 B model; extending the approach to 70 B‑scale Omni‑LLMs may expose new bottlenecks in mask generation latency.
Future directions: explore adaptive token budgets per modality, integrate multimodal token importance learned from downstream task signals, and test on longer‑form content (e.g., full‑length movies).

Authors

Yue Ding
Yiyan Ji
Jungang Li
Xuyang Liu
Xinlong Chen
Junfei Wu
Bozhou Li
Bohan Zeng
Yang Shi
Yushuo Guan
Yuanxing Zhang
Jiaheng Liu
Qiang Liu
Pengfei Wan
Liang Wang

Paper Information

arXiv ID: 2602.04804v1
Categories: cs.CL
Published: February 4, 2026
PDF: Download PDF

[Paper] OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] InftyThink+: Effective and Efficient Infinite-Horizon Reasoning via Reinforcement Learning

[Paper] Optimal Turkish Subword Strategies at Scale: Systematic Evaluation of Data, Vocabulary, Morphology Interplay

[Paper] Uncovering Cross-Objective Interference in Multi-Objective Alignment

[Paper] SEMA: Simple yet Effective Learning for Multi-Turn Jailbreak Attacks