[Paper] Towards One-to-Many Temporal Grounding

Published: 6 days ago (June 4, 2026 at 11:31 AM EDT)

2 min read

Source: arXiv

Source: arXiv - 2606.06294v1

Overview

Temporal Grounding (TG) aims to localize video segments corresponding to a textual query. Prior research predominantly focuses on single‑segment retrieval. Real‑world scenarios, however, often require localizing multiple disjoint segments for a single query — a setting we term One-to-Many Temporal Grounding (OMTG). Previous state‑of‑the‑art MLLMs, optimized for one‑to‑one settings, struggle in this context, often yielding near‑zero scores due to a lack of event cardinality perception.

To bridge this gap, we present a systematic solution with three key contributions:

Benchmark: We establish the first comprehensive OMTG benchmark, introducing Count Accuracy (C‑Acc) and Effective Temporal F1 (EtF1) as evaluation metrics.
Dataset: We curate a high‑quality OMTG dataset comprising 56 k samples through a sophisticated construction pipeline.
Reward Functions: We develop novel temporal and caption reward functions specifically designed for OMTG. The caption reward leverages Chain‑of‑Thought reasoning over dense video captions to explicitly guide policy optimization toward both preciseness and completeness.

Extensive experiments show our model achieves a new state‑of‑the‑art EtF1 of 43.65 % on OMTG Bench, outperforming Gemini 2.5 Pro and Seed‑1.8 by 15.85 % and 15.61 %, respectively.

Key Contributions

cs.CV
cs.AI

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.CV.

Authors

Qi Xu
Yue Tan
Shihao Chen
Jiahao Meng
Anna Wang
Shunping Ji
Hao Fei
Jason Li

Paper Information

arXiv ID: 2606.06294v1
Categories: cs.CV, cs.AI
Published: June 4, 2026
PDF: Download PDF

[Paper] Towards One-to-Many Temporal Grounding

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

[Paper] Planning-aligned Token Compression for Long-Context Autonomous Driving

[Paper] TEVI: Text-Conditioned Editing of Visual Representations via Sparse Autoencoders for Improved Vision-Language Alignment

[Paper] Watch, Remember, Reason: Human-View Video Understanding with MLLMs