[Paper] LongVideoAgent: Multi-Agent Reasoning with Long Videos

Published: 1 month ago (December 23, 2025 at 01:59 PM EST)

3 min read

Source: arXiv

Source: arXiv - 2512.20618v1

Overview

The paper introduces LongVideoAgent, a multi‑agent system that lets a large language model (LLM) reason over hour‑long video episodes without collapsing them into lossy summaries. By delegating grounding and visual extraction to specialized agents, the framework achieves fine‑grained, temporally grounded answers to long‑video question answering (QA) tasks, setting a new benchmark on the newly released LongTVQA and LongTVQA+ datasets.

Key Contributions

Multi‑agent architecture: A master LLM orchestrates a grounding agent (locates relevant video segments) and a vision agent (produces targeted textual observations).
Reinforcement‑learning (RL) fine‑tuning: The master agent is trained with a step‑limited reward that balances answer correctness, conciseness, and computational efficiency.
New episode‑level benchmarks: LongTVQA and LongTVQA+ aggregate full‑length TV episodes from TVQA/TVQA+, providing a realistic testbed for hour‑scale video reasoning.
Interpretability: The system yields explicit reasoning traces—grounded timestamps and extracted observations—that can be inspected by developers.
State‑of‑the‑art performance: The multi‑agent pipeline outperforms strong non‑agent baselines by a sizable margin on both datasets.

Methodology

Master LLM (Planner) – Receives the user question and decides, step‑by‑step, which sub‑tasks to invoke. It is constrained to a maximum number of reasoning steps to keep inference tractable.
Grounding Agent – Given a textual cue from the master, it searches the long video (using pre‑computed visual embeddings and subtitles) and returns a short clip (e.g., a 5‑second window) that is most likely to contain the answer.
Vision Agent – Runs a vision‑language model on the selected clip, producing concise textual observations (object names, actions, scene changes) that complement subtitle text.
Iterative Loop – The master can request additional grounding/vision passes, refine its hypothesis, and finally generate the answer.
RL Training – A reward function penalizes unnecessary steps and rewards correct answers. Proximal Policy Optimization (PPO) is used to fine‑tune the master’s policy while keeping the grounding and vision agents frozen.

The whole pipeline runs on commodity GPUs; the grounding and vision modules can be swapped out for newer models without retraining the master.

Results & Findings

Model	Accuracy (LongTVQA)	Accuracy (LongTVQA+)
Baseline LLM + full‑video concat	42.3 %	38.7 %
Retrieval‑augmented LLM	48.9 %	45.1 %
LongVideoAgent (w/ RL)	57.4 %	53.2 %
LongVideoAgent (no RL)	54.1 %	50.8 %

RL fine‑tuning improves both correctness and step efficiency (average steps ↓ from 7.2 to 5.4).
Grounding reduces irrelevant context: 84 % of retrieved clips contain the answer span versus 61 % for naïve sliding‑window retrieval.
Vision observations add ~12 % absolute gain over subtitle‑only baselines, confirming the value of visual detail.

Practical Implications

Content‑aware assistants: Developers can build chatbots that answer user queries about full‑length movies, lectures, or surveillance footage without pre‑summarizing the media.
Efficient indexing: The grounding agent works on pre‑computed embeddings, enabling fast retrieval even on terabyte‑scale video archives.
Modular upgrades: As better vision‑language models emerge (e.g., Flamingo‑2, GPT‑4V), they can replace the vision agent, instantly boosting performance.
Explainable AI: The explicit clip timestamps and observation logs make it easier to debug or comply with audit requirements in media‑analysis pipelines.
Reduced compute cost: By focusing computation on a handful of short clips instead of processing the entire video, inference costs drop dramatically (≈ 70 % less FLOPs vs. end‑to‑end video LLMs).

Limitations & Future Work

Dependency on subtitles: The current grounding agent heavily leverages subtitle timestamps; videos lacking accurate subtitles may see degraded performance.
Fixed step budget: While a step limit keeps inference cheap, it can truncate complex multi‑hop reasoning; adaptive budgeting is an open direction.
Scalability of vision agent: Processing high‑resolution clips still incurs non‑trivial GPU load; future work could explore lightweight visual tokenizers or hierarchical attention.
Generalization beyond TV: The datasets focus on scripted TV episodes; applying the framework to documentaries, sports, or user‑generated content will require domain‑specific grounding cues.

LongVideoAgent demonstrates that a coordinated multi‑agent approach can finally make hour‑long video reasoning practical for developers, opening the door to richer, temporally aware AI applications.

Authors

Runtao Liu
Ziyi Liu
Jiaqi Tang
Yue Ma
Renjie Pi
Jipeng Zhang
Qifeng Chen

Paper Information

arXiv ID: 2512.20618v1
Categories: cs.AI, cs.CV, cs.LG, cs.MA
Published: December 23, 2025
PDF: Download PDF

[Paper] LongVideoAgent: Multi-Agent Reasoning with Long Videos

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] StreamAvatar: Streaming Diffusion Models for Real-Time Interactive Human Avatars

[Paper] LongFly: Long-Horizon UAV Vision-and-Language Navigation with Spatiotemporal Context Integration

[Paper] Does the Data Processing Inequality Reflect Practice? On the Utility of Low-Level Tasks

[Paper] Improving the Convergence Rate of Ray Search Optimization for Query-Efficient Hard-Label Attacks