[Paper] OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models

Published: (February 4, 2026 at 12:51 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2602.04804v1

Overview

OmniSIFT tackles one of the biggest bottlenecks in omni‑modal large language models (LLMs): the massive computational cost caused by long sequences of video‑and‑audio tokens. By intelligently pruning redundant visual frames and filtering irrelevant audio snippets, the method slashes token length to about a quarter of the original while actually improving downstream performance on several benchmarks.

Key Contributions

  • Modality‑asymmetric compression: separate, specialized pipelines for video (spatio‑temporal pruning) and audio (vision‑guided selection).
  • Two‑stage, end‑to‑end trainable framework: a differentiable straight‑through estimator lets the compression modules be learned jointly with the Omni‑LLM.
  • Tiny overhead: only ~4.85 M extra parameters (≈0.07 % of a 7 B model) and lower latency than existing training‑free baselines such as OmniZip.
  • Strong empirical gains: with just 25 % of the original token count, OmniSIFT outperforms all prior compression methods and even beats the full‑token baseline on several audio‑video understanding tasks.
  • Broad evaluation: validated on five diverse benchmarks covering video QA, audio‑visual reasoning, and multimodal captioning.

Methodology

  1. Spatio‑Temporal Video Pruning

    • Intra‑frame: a lightweight CNN predicts which patches within a frame carry useful information (e.g., moving objects, salient regions).
    • Inter‑frame: a temporal similarity scorer identifies near‑duplicate frames (e.g., static background) and discards them.
    • The two signals are fused, producing a binary mask that drops redundant visual tokens before they reach the LLM.
  2. Vision‑Guided Audio Selection

    • The pruned video representation is used as a “guide” to attend over the raw audio token stream.
    • Audio segments that align poorly with visual cues (e.g., background noise, silent intervals) receive low scores and are removed.
  3. Differentiable Compression

    • Both pruning modules output hard binary decisions, but a straight‑through estimator treats them as continuous during back‑propagation, allowing gradients to flow from the downstream LLM loss.
    • The whole pipeline (pruning + LLM) is trained jointly, so the compressor learns exactly what the language model needs for each task.

Results & Findings

Model (tokens)Latency ↓Avg. Score ↑Notable Gains
Qwen2.5‑Omni‑7B (full)baseline73.2
OmniZip (training‑free)+12 %71.8
OmniSIFT (25 % tokens)‑8 %74.5Beats full‑token model on VideoQA‑X and AVE‑Bench
  • Parameter efficiency: only 4.85 M extra parameters, negligible memory impact.
  • Robustness: performance holds across tasks with different modality balances (e.g., audio‑dominant vs. video‑dominant).
  • Ablation: removing either the intra‑frame or inter‑frame component drops accuracy by ~1.3 %; disabling vision‑guided audio selection reduces audio‑centric scores by ~2 %.

Practical Implications

  • Faster inference for real‑time apps: streaming video assistants, live captioning, or AR/VR experiences can now run Omni‑LLMs on edge GPUs or even high‑end mobiles without sacrificing quality.
  • Cost‑effective scaling: cloud providers can serve more concurrent users per GPU because token length—and thus compute—drops dramatically.
  • Simplified data pipelines: developers can feed raw video/audio streams directly; OmniSIFT handles redundancy removal automatically, reducing the need for handcrafted preprocessing.
  • Energy savings: fewer tokens translate to lower FLOPs, aligning with sustainability goals for large‑scale AI deployments.

Limitations & Future Work

  • Domain sensitivity: the pruning heuristics are learned on the training data; highly specialized domains (e.g., medical imaging) may require fine‑tuning or custom masks.
  • Audio‑only scenarios: when visual cues are absent or minimal, the vision‑guided audio selector provides limited benefit, suggesting a need for a complementary audio‑centric compressor.
  • Scalability to larger LLMs: experiments focused on a 7 B model; extending the approach to 70 B‑scale Omni‑LLMs may expose new bottlenecks in mask generation latency.
  • Future directions: explore adaptive token budgets per modality, integrate multimodal token importance learned from downstream task signals, and test on longer‑form content (e.g., full‑length movies).

Authors

  • Yue Ding
  • Yiyan Ji
  • Jungang Li
  • Xuyang Liu
  • Xinlong Chen
  • Junfei Wu
  • Bozhou Li
  • Bohan Zeng
  • Yang Shi
  • Yushuo Guan
  • Yuanxing Zhang
  • Jiaheng Liu
  • Qiang Liu
  • Pengfei Wan
  • Liang Wang

Paper Information

  • arXiv ID: 2602.04804v1
  • Categories: cs.CL
  • Published: February 4, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »