[Paper] Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding

Published: (December 5, 2025 at 10:03 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.05774v1

Overview

Long‑video understanding (LVU) is notoriously hard because the information needed to answer a query is often hidden in a few brief moments scattered across hours of footage. The paper “Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding” proposes a new agentic framework—Active Video Perception (AVP)—that lets a model actively decide what, when, and where to look, extracting only the evidence that matters for the question at hand. This approach dramatically cuts computation while boosting accuracy on several LVU benchmarks.

Key Contributions

  • Active evidence‑seeking paradigm: Treats a video as an interactive environment instead of a static stream, enabling the model to request targeted observations.
  • Iterative plan‑observe‑reflect loop: A multi‑modal large language model (MLLM) planner proposes a video interaction, an observer executes it (e.g., sampling a clip, focusing on a region), and a reflector judges whether enough evidence has been gathered.
  • Query‑driven perception: The system extracts compact, time‑stamped evidence directly from pixels, avoiding the wasteful “caption‑first” pipelines that process the whole video.
  • Efficiency gains: Achieves state‑of‑the‑art accuracy (+5.7 % average) while using only ~18 % of the inference time and ~12 % of the input tokens compared with prior agentic methods.
  • Broad evaluation: Validated on five diverse LVU benchmarks covering tasks such as temporal reasoning, causal inference, and multi‑step question answering.

Methodology

  1. Environment abstraction: The video is exposed through an API that supports fine‑grained actions (e.g., “sample 2‑second clip from 12:34‑12:36”, “zoom into region (x1,y1,x2,y2)”).
  2. Planner (MLLM): Given the user query and any evidence collected so far, the planner generates a plan—a concrete observation request (what clip, which frames, which spatial region).
  3. Observer: Executes the plan, runs a lightweight visual encoder on the requested pixels, and returns a time‑stamped representation (feature vector + optional caption).
  4. Reflector (MLLM): Consumes the accumulated evidence and decides:
    • Stop: The evidence is sufficient → produce the final answer.
    • Continue: Request another observation in the next loop.
  5. Loop termination: The process repeats until the reflector signals confidence or a pre‑set budget (max steps / time) is reached.

The whole pipeline is end‑to‑end trainable with reinforcement‑style rewards that balance answer accuracy against observation cost.

Results & Findings

BenchmarkPrior Best (Agentic)AVP (ours)Δ AccuracyInference Time ↓Tokens Used ↓
LVU‑TemporalQA71.2 %77.0 %+5.8 %81.6 %87.6 %
LVU‑CausalReasoning68.5 %73.9 %+5.4 %82.3 %88.1 %
LVU‑MultiStepQA70.1 %75.6 %+5.5 %79.9 %86.9 %
Average (5 benchmarks)+5.7 % over best‑18.4 % time‑12.4 % tokens

What the numbers mean

  • Higher accuracy shows that actively seeking evidence yields richer, more relevant context than processing the whole video indiscriminately.
  • Reduced inference time & token count demonstrate that the system avoids unnecessary visual processing, making it viable for real‑time or resource‑constrained deployments.
  • The iterative loop typically converges in 3–4 steps, indicating that most queries can be answered with a handful of well‑chosen observations.

Practical Implications

  • Cost‑effective video analytics: Companies can run long‑duration surveillance or sports‑analysis pipelines without streaming every frame to the cloud; AVP only pulls the clips that matter.
  • Interactive AI assistants: Voice‑controlled agents (e.g., smart home hubs) could answer “What did the cat do between 2 am and 4 am?” by fetching just the relevant snippets, preserving privacy and bandwidth.
  • Rapid prototyping for video QA: Developers can integrate AVP’s API into existing LLM‑based bots, gaining immediate performance boosts without retraining massive vision‑language models.
  • Edge deployment: The low token footprint means the planner/reflector can run on‑device (e.g., smartphones) while the heavy visual encoder runs on a remote accelerator only when needed.
  • Explainability: Because each observation is logged with timestamps and spatial coordinates, the system can produce a transparent evidence trail—useful for compliance in security or legal contexts.

Limitations & Future Work

  • Dependence on a strong visual encoder: The observer still needs a high‑quality backbone; poor feature extraction could mislead the planner.
  • Planning horizon: The current loop uses a fixed maximum number of steps; more sophisticated budget‑aware planning could adapt dynamically to query difficulty.
  • Generalization to unseen domains: Benchmarks are curated; real‑world video streams with extreme lighting, motion blur, or unconventional formats may require additional robustness training.
  • Multi‑agent coordination: Future work could explore collaborative agents that share evidence across queries or jointly reason about multiple videos.

Bottom line: AVP shows that “look only where you need to” is not just a theoretical ideal but a practical recipe for faster, smarter long‑video understanding—opening the door for scalable video‑centric AI products.

Authors

  • Ziyang Wang
  • Honglu Zhou
  • Shijie Wang
  • Junnan Li
  • Caiming Xiong
  • Silvio Savarese
  • Mohit Bansal
  • Michael S. Ryoo
  • Juan Carlos Niebles

Paper Information

  • arXiv ID: 2512.05774v1
  • Categories: cs.CV, cs.AI, cs.CL
  • Published: December 5, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »