[Paper] HFS: Holistic Query-Aware Frame Selection for Efficient Video Reasoning
Source: arXiv - 2512.11534v1
Overview
The paper introduces HFS (Holistic Frame Selection), a new framework that picks the most informative video frames for downstream reasoning tasks (e.g., video QA, captioning) in a way that is aware of the query and optimizes the whole set of frames rather than scoring each frame in isolation. By coupling a small language model with multimodal features and training the selector end‑to‑end, HFS dramatically reduces redundancy and boosts performance on several video‑understanding benchmarks.
Key Contributions
- Query‑aware implicit vectors: A chain‑of‑thought prompt drives a small language model (SLM) to produce task‑specific query embeddings that steer frame scoring.
- Set‑level differentiable objective: A continuous loss that jointly balances relevance, coverage, and redundancy, optimized with Gumbel‑Softmax to select the best combination of frames.
- Student‑teacher mutual learning: The SLM selector (student) and a multimodal large language model reasoner (teacher) are co‑trained, aligning their frame‑importance distributions via KL divergence.
- End‑to‑end training: Eliminates the need for static pseudo‑labels generated offline, allowing the selector to adapt dynamically to each downstream task.
- State‑of‑the‑art results: Consistently outperforms prior frame‑selection baselines on Video‑MME, LongVideoBench, MLVU, and NExT‑QA.
Methodology
- Implicit Query Generation – A chain‑of‑thought prompt (e.g., “Explain why the question matters”) is fed to a lightweight language model. The model outputs a dense query vector that captures the semantics of the current task (question, caption, etc.).
- Multimodal Feature Fusion – For every video frame, visual features (e.g., CLIP embeddings) are concatenated with the query vector, producing a joint representation.
- Holistic Scoring – Instead of assigning independent scores, the method defines a set‑level loss:
- Relevance: frames should be useful for answering the query.
- Coverage: the selected set should span the temporal span of the video.
- Redundancy: penalize selecting visually or semantically similar frames.
This loss is differentiable thanks to the Gumbel‑Softmax trick, which approximates discrete selection while allowing gradient flow.
- Student‑Teacher Mutual Learning – The teacher (a powerful multimodal LLM) processes the full video and produces a soft importance distribution over frames. The student selector learns to mimic this distribution (KL divergence) while also being guided by the cross‑entropy loss from the downstream task.
- End‑to‑End Optimization – All components—query generator, frame scorer, and downstream reasoner—are trained jointly, so the selector learns to pick frames that directly improve the final task metric.
Results & Findings
| Benchmark | Metric (↑ better) | HFS vs. Best Prior |
|---|---|---|
| Video‑MME | 73.4% accuracy | +5.2 pts |
| LongVideoBench | 68.1% R@1 | +6.8 pts |
| MLVU | 71.9% mAP | +4.5 pts |
| NExT‑QA | 62.3% accuracy | +5.9 pts |
- Redundancy reduction: Visual inspection shows selected frames are spread across the video timeline, avoiding clusters that plague top‑K methods.
- Query sensitivity: Changing the question leads to markedly different frame sets, confirming the query‑aware design.
- Efficiency: The small selector runs in < 10 ms per video on a single GPU, enabling real‑time pipelines.
Practical Implications
- Cost‑effective video analytics – By selecting a handful of high‑utility frames, developers can feed lighter models (e.g., edge‑device vision models) without sacrificing accuracy, reducing compute and memory footprints.
- Improved video QA assistants – Chat‑based assistants that answer questions about long videos can now retrieve relevant moments faster, delivering more precise responses.
- Content moderation & indexing – Automated systems can focus on the most informative frames for detecting policy violations or generating searchable metadata, speeding up pipelines.
- Plug‑and‑play component – HFS is model‑agnostic; it can be dropped into existing video‑reasoning stacks (e.g., CLIP‑based captioners, LLM‑driven video agents) with minimal code changes.
Limitations & Future Work
- Dependence on a strong teacher: The mutual‑learning setup assumes access to a capable multimodal LLM, which may be unavailable or costly for some teams.
- Scalability to ultra‑long videos: While HFS handles videos up to several minutes, videos spanning hours (e.g., surveillance footage) may still require hierarchical selection strategies.
- Prompt design for query generation: The chain‑of‑thought prompts were hand‑crafted; automating prompt discovery could further improve robustness across domains.
- Future directions: The authors suggest exploring reinforcement‑learning‑based selection, extending the framework to multimodal streams (audio + video), and investigating self‑supervised pre‑training for the selector to reduce reliance on large teachers.
Authors
- Yiqing Yang
- Kin‑Man Lam
Paper Information
- arXiv ID: 2512.11534v1
- Categories: cs.CV, cs.CL, cs.MM
- Published: December 12, 2025
- PDF: Download PDF