[Paper] HFS: Holistic Query-Aware Frame Selection for Efficient Video Reasoning

Published: 1 month ago (December 12, 2025 at 08:10 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.11534v1

Overview

The paper introduces HFS (Holistic Frame Selection), a new framework that picks the most informative video frames for downstream reasoning tasks (e.g., video QA, captioning) in a way that is aware of the query and optimizes the whole set of frames rather than scoring each frame in isolation. By coupling a small language model with multimodal features and training the selector end‑to‑end, HFS dramatically reduces redundancy and boosts performance on several video‑understanding benchmarks.

Key Contributions

Query‑aware implicit vectors: A chain‑of‑thought prompt drives a small language model (SLM) to produce task‑specific query embeddings that steer frame scoring.
Set‑level differentiable objective: A continuous loss that jointly balances relevance, coverage, and redundancy, optimized with Gumbel‑Softmax to select the best combination of frames.
Student‑teacher mutual learning: The SLM selector (student) and a multimodal large language model reasoner (teacher) are co‑trained, aligning their frame‑importance distributions via KL divergence.
End‑to‑end training: Eliminates the need for static pseudo‑labels generated offline, allowing the selector to adapt dynamically to each downstream task.
State‑of‑the‑art results: Consistently outperforms prior frame‑selection baselines on Video‑MME, LongVideoBench, MLVU, and NExT‑QA.

Methodology

Implicit Query Generation – A chain‑of‑thought prompt (e.g., “Explain why the question matters”) is fed to a lightweight language model. The model outputs a dense query vector that captures the semantics of the current task (question, caption, etc.).
Multimodal Feature Fusion – For every video frame, visual features (e.g., CLIP embeddings) are concatenated with the query vector, producing a joint representation.
Holistic Scoring – Instead of assigning independent scores, the method defines a set‑level loss:
- Relevance: frames should be useful for answering the query.
- Coverage: the selected set should span the temporal span of the video.
- Redundancy: penalize selecting visually or semantically similar frames.
  This loss is differentiable thanks to the Gumbel‑Softmax trick, which approximates discrete selection while allowing gradient flow.
Student‑Teacher Mutual Learning – The teacher (a powerful multimodal LLM) processes the full video and produces a soft importance distribution over frames. The student selector learns to mimic this distribution (KL divergence) while also being guided by the cross‑entropy loss from the downstream task.
End‑to‑End Optimization – All components—query generator, frame scorer, and downstream reasoner—are trained jointly, so the selector learns to pick frames that directly improve the final task metric.

Results & Findings

Benchmark	Metric (↑ better)	HFS vs. Best Prior
Video‑MME	73.4% accuracy	+5.2 pts
LongVideoBench	68.1% R@1	+6.8 pts
MLVU	71.9% mAP	+4.5 pts
NExT‑QA	62.3% accuracy	+5.9 pts

Redundancy reduction: Visual inspection shows selected frames are spread across the video timeline, avoiding clusters that plague top‑K methods.
Query sensitivity: Changing the question leads to markedly different frame sets, confirming the query‑aware design.
Efficiency: The small selector runs in < 10 ms per video on a single GPU, enabling real‑time pipelines.

Practical Implications

Cost‑effective video analytics – By selecting a handful of high‑utility frames, developers can feed lighter models (e.g., edge‑device vision models) without sacrificing accuracy, reducing compute and memory footprints.
Improved video QA assistants – Chat‑based assistants that answer questions about long videos can now retrieve relevant moments faster, delivering more precise responses.
Content moderation & indexing – Automated systems can focus on the most informative frames for detecting policy violations or generating searchable metadata, speeding up pipelines.
Plug‑and‑play component – HFS is model‑agnostic; it can be dropped into existing video‑reasoning stacks (e.g., CLIP‑based captioners, LLM‑driven video agents) with minimal code changes.

Limitations & Future Work

Dependence on a strong teacher: The mutual‑learning setup assumes access to a capable multimodal LLM, which may be unavailable or costly for some teams.
Scalability to ultra‑long videos: While HFS handles videos up to several minutes, videos spanning hours (e.g., surveillance footage) may still require hierarchical selection strategies.
Prompt design for query generation: The chain‑of‑thought prompts were hand‑crafted; automating prompt discovery could further improve robustness across domains.
Future directions: The authors suggest exploring reinforcement‑learning‑based selection, extending the framework to multimodal streams (audio + video), and investigating self‑supervised pre‑training for the selector to reduce reliance on large teachers.

Authors

Yiqing Yang
Kin‑Man Lam

Paper Information

arXiv ID: 2512.11534v1
Categories: cs.CV, cs.CL, cs.MM
Published: December 12, 2025
PDF: Download PDF

[Paper] HFS: Holistic Query-Aware Frame Selection for Efficient Video Reasoning

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] DentalGPT: Incentivizing Multimodal Complex Reasoning in Dentistry

[Paper] Are We Ready for RL in Text-to-3D Generation? A Progressive Investigation

[Paper] Stronger Normalization-Free Transformers

[Paper] MedForget: Hierarchy-Aware Multimodal Unlearning Testbed for Medical AI