[Paper] OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models
Source: arXiv - 2602.04804v1
Overview
OmniSIFT tackles one of the biggest bottlenecks in omni‑modal large language models (LLMs): the massive computational cost caused by long sequences of video‑and‑audio tokens. By intelligently pruning redundant visual frames and filtering irrelevant audio snippets, the method slashes token length to about a quarter of the original while actually improving downstream performance on several benchmarks.
Key Contributions
- Modality‑asymmetric compression: separate, specialized pipelines for video (spatio‑temporal pruning) and audio (vision‑guided selection).
- Two‑stage, end‑to‑end trainable framework: a differentiable straight‑through estimator lets the compression modules be learned jointly with the Omni‑LLM.
- Tiny overhead: only ~4.85 M extra parameters (≈0.07 % of a 7 B model) and lower latency than existing training‑free baselines such as OmniZip.
- Strong empirical gains: with just 25 % of the original token count, OmniSIFT outperforms all prior compression methods and even beats the full‑token baseline on several audio‑video understanding tasks.
- Broad evaluation: validated on five diverse benchmarks covering video QA, audio‑visual reasoning, and multimodal captioning.
Methodology
-
Spatio‑Temporal Video Pruning
- Intra‑frame: a lightweight CNN predicts which patches within a frame carry useful information (e.g., moving objects, salient regions).
- Inter‑frame: a temporal similarity scorer identifies near‑duplicate frames (e.g., static background) and discards them.
- The two signals are fused, producing a binary mask that drops redundant visual tokens before they reach the LLM.
-
Vision‑Guided Audio Selection
- The pruned video representation is used as a “guide” to attend over the raw audio token stream.
- Audio segments that align poorly with visual cues (e.g., background noise, silent intervals) receive low scores and are removed.
-
Differentiable Compression
- Both pruning modules output hard binary decisions, but a straight‑through estimator treats them as continuous during back‑propagation, allowing gradients to flow from the downstream LLM loss.
- The whole pipeline (pruning + LLM) is trained jointly, so the compressor learns exactly what the language model needs for each task.
Results & Findings
| Model (tokens) | Latency ↓ | Avg. Score ↑ | Notable Gains |
|---|---|---|---|
| Qwen2.5‑Omni‑7B (full) | baseline | 73.2 | – |
| OmniZip (training‑free) | +12 % | 71.8 | – |
| OmniSIFT (25 % tokens) | ‑8 % | 74.5 | Beats full‑token model on VideoQA‑X and AVE‑Bench |
- Parameter efficiency: only 4.85 M extra parameters, negligible memory impact.
- Robustness: performance holds across tasks with different modality balances (e.g., audio‑dominant vs. video‑dominant).
- Ablation: removing either the intra‑frame or inter‑frame component drops accuracy by ~1.3 %; disabling vision‑guided audio selection reduces audio‑centric scores by ~2 %.
Practical Implications
- Faster inference for real‑time apps: streaming video assistants, live captioning, or AR/VR experiences can now run Omni‑LLMs on edge GPUs or even high‑end mobiles without sacrificing quality.
- Cost‑effective scaling: cloud providers can serve more concurrent users per GPU because token length—and thus compute—drops dramatically.
- Simplified data pipelines: developers can feed raw video/audio streams directly; OmniSIFT handles redundancy removal automatically, reducing the need for handcrafted preprocessing.
- Energy savings: fewer tokens translate to lower FLOPs, aligning with sustainability goals for large‑scale AI deployments.
Limitations & Future Work
- Domain sensitivity: the pruning heuristics are learned on the training data; highly specialized domains (e.g., medical imaging) may require fine‑tuning or custom masks.
- Audio‑only scenarios: when visual cues are absent or minimal, the vision‑guided audio selector provides limited benefit, suggesting a need for a complementary audio‑centric compressor.
- Scalability to larger LLMs: experiments focused on a 7 B model; extending the approach to 70 B‑scale Omni‑LLMs may expose new bottlenecks in mask generation latency.
- Future directions: explore adaptive token budgets per modality, integrate multimodal token importance learned from downstream task signals, and test on longer‑form content (e.g., full‑length movies).
Authors
- Yue Ding
- Yiyan Ji
- Jungang Li
- Xuyang Liu
- Xinlong Chen
- Junfei Wu
- Bozhou Li
- Bohan Zeng
- Yang Shi
- Yushuo Guan
- Yuanxing Zhang
- Jiaheng Liu
- Qiang Liu
- Pengfei Wan
- Liang Wang
Paper Information
- arXiv ID: 2602.04804v1
- Categories: cs.CL
- Published: February 4, 2026
- PDF: Download PDF