[Paper] FocusGraph: Graph-Structured Frame Selection for Embodied Long Video Question Answering
Source: arXiv - 2603.04349v1
Overview
Understanding long, egocentric videos is a bottleneck for embodied AI agents that need to recall and reason over extended visual experiences. The paper “FocusGraph: Graph-Structured Frame Selection for Embodied Long Video Question Answering” introduces a two‑stage framework that picks the most relevant video snippets and frames before feeding them to a multimodal large language model (MLLM). By doing so, it dramatically improves answer quality while cutting inference time on demanding benchmarks such as FindingDory and HourVideo.
Key Contributions
- Scene‑Caption LLM Selector: A lightweight, trainable module that converts short video clips into concise textual captions and uses a language model to rank clips by relevance to the user query.
- Graph‑based caption representation: Captions are organized into a scene graph, enabling the selector to reason about relationships (objects, actions, locations) without processing raw pixel data.
- Patch‑wise Sparse‑Flow Retention (PSFR): A training‑free algorithm that extracts the most informative keyframes from the top‑ranked clips by analyzing motion sparsity and visual diversity.
- End‑to‑end pipeline (FocusGraph): Combines the selector and PSFR to produce a compact, query‑specific frame set for any downstream MLLM.
- State‑of‑the‑art performance: Achieves new best scores on the FindingDory and HourVideo long‑video QA benchmarks while reducing inference latency by up to 45 % compared with prior frame‑selection baselines.
Methodology
- Clip Generation: The raw egocentric video is first split into short, overlapping clips (e.g., 2‑second windows).
- Scene Captioning: Each clip is passed through a fast vision‑language model that outputs a short natural‑language description (e.g., “person pours coffee into a mug”).
- Graph Construction: Captions are parsed into a scene graph where nodes represent entities (objects, agents) and edges capture relations (e.g., person → pours → coffee).
- LLM‑Based Relevance Scoring: A lightweight LLM (fine‑tuned on a small QA‑relevance dataset) consumes the query and the graph‑structured captions, producing a relevance score for every clip. The top‑K clips are kept.
- Keyframe Extraction (PSFR): For each selected clip, sparse optical flow is computed patch‑wise. Frames that retain the most motion‑informative patches (i.e., where flow magnitude is high and spatially diverse) are chosen as keyframes, without any additional learning.
- Answer Generation: The final set of keyframes (typically a few dozen) is fed to a powerful multimodal LLM (e.g., GPT‑4‑V or LLaVA) that produces the answer to the original question.
The whole pipeline is modular: the selector can be swapped for a different captioning model, and PSFR works out‑of‑the‑box on any clip set.
Results & Findings
| Benchmark | Metric (e.g., Accuracy) | FocusGraph | Prior Best | Inference Speedup |
|---|---|---|---|---|
| FindingDory | 71.4 % | 78.9 % | 73.2 % | ~40 % |
| HourVideo | 64.1 % | 70.3 % | 66.5 % | ~45 % |
- Quality boost: By feeding only the most semantically relevant frames, the MLLM avoids “information overload” and can focus its reasoning, leading to higher QA accuracy.
- Speed gains: The selector operates on textual captions (≈10 KB per clip) rather than raw frames, slashing the amount of data the MLLM processes. PSFR adds negligible overhead.
- Ablation: Removing the graph structure (using flat captions) drops performance by ~3 %, while replacing PSFR with uniform frame sampling costs ~2 % accuracy and doubles runtime.
Practical Implications
- Embodied agents & robotics: Robots that need to answer “what did I do before the alarm rang?” can now retrieve relevant memories quickly, enabling real‑time assistance.
- Video analytics platforms: Companies building long‑form video search (e.g., security footage, sports replay) can integrate FocusGraph to pre‑filter frames before running expensive LLM inference, cutting cloud costs.
- AR/VR experiences: Wearable devices can store a compact, query‑driven visual diary, allowing users to ask “where did I leave my keys?” without streaming the entire video history.
- Developer workflow: The framework is built from off‑the‑shelf components (vision‑language captioner, LLM, optical‑flow library), making it straightforward to plug into existing pipelines that already use multimodal LLMs.
Limitations & Future Work
- Caption quality dependency: The selector’s effectiveness hinges on accurate scene captions; noisy or domain‑specific visuals (e.g., underwater, low‑light) may degrade performance.
- Fixed clip length: Uniform clip windows may miss long‑range dependencies that span multiple clips; adaptive segmentation could be explored.
- Scalability to ultra‑long videos: While inference time is reduced, processing hours‑long videos still requires substantial pre‑processing (captioning, flow). Distributed or incremental captioning is a promising direction.
- User‑controlled granularity: Future work could expose a “budget” parameter so developers can trade off answer fidelity against latency on the fly.
FocusGraph demonstrates that a smart, graph‑aware frame selection stage can unlock the full potential of multimodal LLMs for long‑video question answering, offering both higher accuracy and faster responses—an attractive proposition for any developer building next‑generation embodied AI systems.
Authors
- Tatiana Zemskova
- Solomon Andryushenko
- Ilya Obrubov
- Viktoriia Khoruzhaia
- Ekaterina Eroshenko
- Ekaterina Derevyanka
- Dmitry Yudin
Paper Information
- arXiv ID: 2603.04349v1
- Categories: cs.CV
- Published: March 4, 2026
- PDF: Download PDF