[Paper] LongVideoAgent: Multi-Agent Reasoning with Long Videos

Published: (December 23, 2025 at 01:59 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2512.20618v1

Overview

The paper introduces LongVideoAgent, a multi‑agent system that lets a large language model (LLM) reason over hour‑long video episodes without collapsing them into lossy summaries. By delegating grounding and visual extraction to specialized agents, the framework achieves fine‑grained, temporally grounded answers to long‑video question answering (QA) tasks, setting a new benchmark on the newly released LongTVQA and LongTVQA+ datasets.

Key Contributions

  • Multi‑agent architecture: A master LLM orchestrates a grounding agent (locates relevant video segments) and a vision agent (produces targeted textual observations).
  • Reinforcement‑learning (RL) fine‑tuning: The master agent is trained with a step‑limited reward that balances answer correctness, conciseness, and computational efficiency.
  • New episode‑level benchmarks: LongTVQA and LongTVQA+ aggregate full‑length TV episodes from TVQA/TVQA+, providing a realistic testbed for hour‑scale video reasoning.
  • Interpretability: The system yields explicit reasoning traces—grounded timestamps and extracted observations—that can be inspected by developers.
  • State‑of‑the‑art performance: The multi‑agent pipeline outperforms strong non‑agent baselines by a sizable margin on both datasets.

Methodology

  1. Master LLM (Planner) – Receives the user question and decides, step‑by‑step, which sub‑tasks to invoke. It is constrained to a maximum number of reasoning steps to keep inference tractable.
  2. Grounding Agent – Given a textual cue from the master, it searches the long video (using pre‑computed visual embeddings and subtitles) and returns a short clip (e.g., a 5‑second window) that is most likely to contain the answer.
  3. Vision Agent – Runs a vision‑language model on the selected clip, producing concise textual observations (object names, actions, scene changes) that complement subtitle text.
  4. Iterative Loop – The master can request additional grounding/vision passes, refine its hypothesis, and finally generate the answer.
  5. RL Training – A reward function penalizes unnecessary steps and rewards correct answers. Proximal Policy Optimization (PPO) is used to fine‑tune the master’s policy while keeping the grounding and vision agents frozen.

The whole pipeline runs on commodity GPUs; the grounding and vision modules can be swapped out for newer models without retraining the master.

Results & Findings

ModelAccuracy (LongTVQA)Accuracy (LongTVQA+)
Baseline LLM + full‑video concat42.3 %38.7 %
Retrieval‑augmented LLM48.9 %45.1 %
LongVideoAgent (w/ RL)57.4 %53.2 %
LongVideoAgent (no RL)54.1 %50.8 %
  • RL fine‑tuning improves both correctness and step efficiency (average steps ↓ from 7.2 to 5.4).
  • Grounding reduces irrelevant context: 84 % of retrieved clips contain the answer span versus 61 % for naïve sliding‑window retrieval.
  • Vision observations add ~12 % absolute gain over subtitle‑only baselines, confirming the value of visual detail.

Practical Implications

  • Content‑aware assistants: Developers can build chatbots that answer user queries about full‑length movies, lectures, or surveillance footage without pre‑summarizing the media.
  • Efficient indexing: The grounding agent works on pre‑computed embeddings, enabling fast retrieval even on terabyte‑scale video archives.
  • Modular upgrades: As better vision‑language models emerge (e.g., Flamingo‑2, GPT‑4V), they can replace the vision agent, instantly boosting performance.
  • Explainable AI: The explicit clip timestamps and observation logs make it easier to debug or comply with audit requirements in media‑analysis pipelines.
  • Reduced compute cost: By focusing computation on a handful of short clips instead of processing the entire video, inference costs drop dramatically (≈ 70 % less FLOPs vs. end‑to‑end video LLMs).

Limitations & Future Work

  • Dependency on subtitles: The current grounding agent heavily leverages subtitle timestamps; videos lacking accurate subtitles may see degraded performance.
  • Fixed step budget: While a step limit keeps inference cheap, it can truncate complex multi‑hop reasoning; adaptive budgeting is an open direction.
  • Scalability of vision agent: Processing high‑resolution clips still incurs non‑trivial GPU load; future work could explore lightweight visual tokenizers or hierarchical attention.
  • Generalization beyond TV: The datasets focus on scripted TV episodes; applying the framework to documentaries, sports, or user‑generated content will require domain‑specific grounding cues.

LongVideoAgent demonstrates that a coordinated multi‑agent approach can finally make hour‑long video reasoning practical for developers, opening the door to richer, temporally aware AI applications.

Authors

  • Runtao Liu
  • Ziyi Liu
  • Jiaqi Tang
  • Yue Ma
  • Renjie Pi
  • Jipeng Zhang
  • Qifeng Chen

Paper Information

  • arXiv ID: 2512.20618v1
  • Categories: cs.AI, cs.CV, cs.LG, cs.MA
  • Published: December 23, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »