[Paper] HFS: Holistic Query-Aware Frame Selection for Efficient Video Reasoning

Published: (December 12, 2025 at 08:10 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.11534v1

Overview

The paper introduces HFS (Holistic Frame Selection), a new framework that picks the most informative video frames for downstream reasoning tasks (e.g., video QA, captioning) in a way that is aware of the query and optimizes the whole set of frames rather than scoring each frame in isolation. By coupling a small language model with multimodal features and training the selector end‑to‑end, HFS dramatically reduces redundancy and boosts performance on several video‑understanding benchmarks.

Key Contributions

  • Query‑aware implicit vectors: A chain‑of‑thought prompt drives a small language model (SLM) to produce task‑specific query embeddings that steer frame scoring.
  • Set‑level differentiable objective: A continuous loss that jointly balances relevance, coverage, and redundancy, optimized with Gumbel‑Softmax to select the best combination of frames.
  • Student‑teacher mutual learning: The SLM selector (student) and a multimodal large language model reasoner (teacher) are co‑trained, aligning their frame‑importance distributions via KL divergence.
  • End‑to‑end training: Eliminates the need for static pseudo‑labels generated offline, allowing the selector to adapt dynamically to each downstream task.
  • State‑of‑the‑art results: Consistently outperforms prior frame‑selection baselines on Video‑MME, LongVideoBench, MLVU, and NExT‑QA.

Methodology

  1. Implicit Query Generation – A chain‑of‑thought prompt (e.g., “Explain why the question matters”) is fed to a lightweight language model. The model outputs a dense query vector that captures the semantics of the current task (question, caption, etc.).
  2. Multimodal Feature Fusion – For every video frame, visual features (e.g., CLIP embeddings) are concatenated with the query vector, producing a joint representation.
  3. Holistic Scoring – Instead of assigning independent scores, the method defines a set‑level loss:
    • Relevance: frames should be useful for answering the query.
    • Coverage: the selected set should span the temporal span of the video.
    • Redundancy: penalize selecting visually or semantically similar frames.
      This loss is differentiable thanks to the Gumbel‑Softmax trick, which approximates discrete selection while allowing gradient flow.
  4. Student‑Teacher Mutual Learning – The teacher (a powerful multimodal LLM) processes the full video and produces a soft importance distribution over frames. The student selector learns to mimic this distribution (KL divergence) while also being guided by the cross‑entropy loss from the downstream task.
  5. End‑to‑End Optimization – All components—query generator, frame scorer, and downstream reasoner—are trained jointly, so the selector learns to pick frames that directly improve the final task metric.

Results & Findings

BenchmarkMetric (↑ better)HFS vs. Best Prior
Video‑MME73.4% accuracy+5.2 pts
LongVideoBench68.1% R@1+6.8 pts
MLVU71.9% mAP+4.5 pts
NExT‑QA62.3% accuracy+5.9 pts
  • Redundancy reduction: Visual inspection shows selected frames are spread across the video timeline, avoiding clusters that plague top‑K methods.
  • Query sensitivity: Changing the question leads to markedly different frame sets, confirming the query‑aware design.
  • Efficiency: The small selector runs in < 10 ms per video on a single GPU, enabling real‑time pipelines.

Practical Implications

  • Cost‑effective video analytics – By selecting a handful of high‑utility frames, developers can feed lighter models (e.g., edge‑device vision models) without sacrificing accuracy, reducing compute and memory footprints.
  • Improved video QA assistants – Chat‑based assistants that answer questions about long videos can now retrieve relevant moments faster, delivering more precise responses.
  • Content moderation & indexing – Automated systems can focus on the most informative frames for detecting policy violations or generating searchable metadata, speeding up pipelines.
  • Plug‑and‑play component – HFS is model‑agnostic; it can be dropped into existing video‑reasoning stacks (e.g., CLIP‑based captioners, LLM‑driven video agents) with minimal code changes.

Limitations & Future Work

  • Dependence on a strong teacher: The mutual‑learning setup assumes access to a capable multimodal LLM, which may be unavailable or costly for some teams.
  • Scalability to ultra‑long videos: While HFS handles videos up to several minutes, videos spanning hours (e.g., surveillance footage) may still require hierarchical selection strategies.
  • Prompt design for query generation: The chain‑of‑thought prompts were hand‑crafted; automating prompt discovery could further improve robustness across domains.
  • Future directions: The authors suggest exploring reinforcement‑learning‑based selection, extending the framework to multimodal streams (audio + video), and investigating self‑supervised pre‑training for the selector to reduce reliance on large teachers.

Authors

  • Yiqing Yang
  • Kin‑Man Lam

Paper Information

  • arXiv ID: 2512.11534v1
  • Categories: cs.CV, cs.CL, cs.MM
  • Published: December 12, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »