[Paper] Natural-Language Temporal Grounding in Hour-Long Videos is a Search Problem: A Benchmark and Empirical Decomposition

Published: 3 days ago (June 10, 2026 at 12:35 PM EDT)

2 min read

Source: arXiv

Source: arXiv - 2606.12300v1

Overview

Temporal grounding—returning the interval $[t_s, t_e]$ for a natural-language query over a video—is the language interface to long-form video, yet has been studied on short videos; the dynamics of hour-scale natural-language grounding remain underexplored. We take the position that at hour-scale, the binding constraint is search, not recognition: Video-LLMs are bottlenecked not by localizing a nearby event, but—given a natural-language query—by searching for the relevant region of a long video. To test this, we release ExtremeWhenBench, the first open hour-scale grounding benchmark (2,273 queries over 194 videos, mean 75.7 min, max 9 hr) with an open-form query distribution. Every open Video-LLM collapses while a frame-level retrieval baseline outperforms them; a failure taxonomy attributes 85% of failures to search; and a retrieve-then-ground hybrid recovers 6.7x over the monolithic Video-LLM—mirroring retrieve-then-read in open-domain QA.

Key Contributions

This paper presents research in the following areas:

cs.CV
cs.AI

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.CV.

Authors

Sukmin Seo
Geewook Kim

Paper Information

arXiv ID: 2606.12300v1
Categories: cs.CV, cs.AI
Published: June 10, 2026
PDF: Download PDF

[Paper] Natural-Language Temporal Grounding in Hour-Long Videos is a Search Problem: A Benchmark and Empirical Decomposition

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] Mana: Dexterous Manipulation of Articulated Tools

[Paper] SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning

[Paper] EvTexture++: Event-Driven Texture Enhancement for Video Super-Resolution

[Paper] Contrast-Informed Augmentation and Domain-Adversarial Training for Adult-to-Neonatal MR Reconstruction Generalization