[Paper] Video-o3: Native Interleaved Clue Seeking for Long Video Multi-Hop Reasoning

Published: 1 week ago (January 30, 2026 at 12:47 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.23224v1

Overview

The paper presents Video‑o3, a new multimodal reasoning framework that lets a model “search” through long videos the way a human would—by iteratively spotting, zooming into, and confirming visual clues until it has enough evidence to answer a question. By integrating native tool calls (e.g., segment retrieval, frame‑level inspection) directly into the reasoning loop, Video‑o3 overcomes the brittleness of existing long‑video LLMs that rely on coarse, uniform sampling and single‑turn inference.

Key Contributions

Interleaved clue‑seeking loop – A native tool‑invocation architecture that alternates between language reasoning and video‑specific actions (segment fetch, frame zoom, termination) in a single, end‑to‑end model.
Task‑Decoupled Attention Masking – A novel attention scheme that isolates the reasoning step from the tool‑calling step, preventing attention “diffusion” while still sharing a global video context.
Verifiable Trajectory‑Guided Reward – A reinforcement‑learning reward that balances exploration (covering more of the video) with efficiency (stopping early once enough evidence is gathered), keeping the interaction length tractable.
Seeker‑173K dataset – A large‑scale synthetic corpus of 173 K tool‑interaction trajectories (question → series of video tool calls → answer) that enables supervised pre‑training and RL fine‑tuning of the interleaved system.
State‑of‑the‑art performance – Video‑o3 achieves 72.1 % accuracy on the MLVU benchmark and 46.5 % on Video‑Holmes, substantially surpassing prior multimodal LLMs on long‑video multi‑hop reasoning tasks.

Methodology

Unified Model Backbone – A large language model (LLM) is augmented with a set of video‑specific tools (segment selector, frame extractor, evidence verifier). The LLM generates a textual “plan” at each turn, which may include a tool call.
Task‑Decoupled Attention – During a reasoning turn, the model attends only to the textual prompt and the global video embedding. When a tool call is issued, a separate attention mask isolates the tool‑specific inputs (e.g., timestamps, frame IDs) so the model can focus on the low‑level visual operation without contaminating the higher‑level reasoning context.
Iterative Loop –
- Seek: The model predicts a segment likely to contain evidence.
- Inspect: It fetches frames or sub‑segments for fine‑grained inspection.
- Verify: It decides whether the gathered evidence suffices; if not, it repeats.
Training Pipeline –
- Supervised pre‑training on Seeker‑173K provides the model with exemplar trajectories.
- Reinforcement learning with the trajectory‑guided reward fine‑tunes the policy to maximize correct answers while minimizing unnecessary tool calls.
Termination Policy – A learned “stop” token lets the model end the loop as soon as confidence crosses a threshold, preventing runaway interaction lengths.

Results & Findings

Benchmark	Prior SOTA	Video‑o3	Δ (absolute)
MLVU (accuracy)	64.3 %	72.1 %	+7.8 %
Video‑Holmes (accuracy)	38.2 %	46.5 %	+8.3 %

Evidence‑Seeking Efficiency – On average, Video‑o3 required 3.2 tool calls per question, compared to 7.1 calls in a naïve exhaustive baseline, while still achieving higher accuracy.
Ablation – Removing Task‑Decoupled Attention dropped performance by ~4 %, confirming its role in preserving reasoning focus.
Reward Impact – The trajectory‑guided reward reduced average interaction length by 28 % without hurting accuracy, demonstrating effective trade‑off control.

Practical Implications

Developer‑friendly APIs – The native tool‑invocation design maps cleanly to function‑calling interfaces (e.g., OpenAI function calls), making it straightforward to embed Video‑o3 into existing LLM pipelines.
Cost‑Effective Video QA – By only pulling high‑resolution frames for segments that matter, cloud compute and storage costs are dramatically reduced compared with brute‑force frame‑sampling approaches.
Real‑World Use Cases –
- Customer Support: Automated troubleshooting from long product demo videos (e.g., “When does the printer jam?”).
- Content Moderation: Spotting policy‑violating moments in hours‑long livestreams without scanning every second.
- Education: Interactive Q&A over lecture recordings where the system can jump to the exact slide or demonstration.
Extensibility – The framework can be extended with domain‑specific tools (e.g., OCR, speech‑to‑text) to handle multimodal evidence beyond raw pixels.

Limitations & Future Work

Synthetic Training Bias – Seeker‑173K is generated automatically; while it covers many patterns, it may miss edge‑case reasoning strategies that appear in truly wild videos.
Scalability to Ultra‑Long Streams – Interaction length is still bounded by the LLM’s context window; handling multi‑hour streams may require hierarchical summarization or external memory.
Tool Set Generality – The current toolbox focuses on segment/frame retrieval; adding richer modalities (audio, subtitles, metadata) is left for future extensions.
Explainability – Although the model logs its tool calls, presenting a human‑readable “reasoning trace” that non‑technical users can audit remains an open challenge.

Video‑o3 shows that giving LLMs native, iterative access to video‑specific tools can turn a “black‑box” model into an active investigator, dramatically improving long‑video multi‑hop reasoning while keeping compute budgets realistic. For developers building next‑generation video assistants, the paper offers both a concrete architecture to emulate and a large synthetic dataset to kick‑start training.

Authors

Xiangyu Zeng
Zhiqiu Zhang
Yuhan Zhu
Xinhao Li
Zikang Wang
Changlian Ma
Qingyu Zhang
Zizheng Huang
Kun Ouyang
Tianxiang Jiang
Ziang Yan
Yi Wang
Hongjie Zhang
Yali Wang
Limin Wang

Paper Information

arXiv ID: 2601.23224v1
Categories: cs.CV
Published: January 30, 2026
PDF: Download PDF

[Paper] Video-o3: Native Interleaved Clue Seeking for Long Video Multi-Hop Reasoning

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation

[Paper] User Prompting Strategies and Prompt Enhancement Methods for Open-Set Object Detection in XR Environments

[Paper] Denoising the Deep Sky: Physics-Based CCD Noise Formation for Astronomical Imaging

[Paper] PaperBanana: Automating Academic Illustration for AI Scientists