[Paper] Towards One-to-Many Temporal Grounding
Source: arXiv - 2606.06294v1
Overview
Temporal Grounding (TG) aims to localize video segments corresponding to a textual query. Prior research predominantly focuses on single‑segment retrieval. Real‑world scenarios, however, often require localizing multiple disjoint segments for a single query — a setting we term One-to-Many Temporal Grounding (OMTG). Previous state‑of‑the‑art MLLMs, optimized for one‑to‑one settings, struggle in this context, often yielding near‑zero scores due to a lack of event cardinality perception.
To bridge this gap, we present a systematic solution with three key contributions:
- Benchmark: We establish the first comprehensive OMTG benchmark, introducing Count Accuracy (C‑Acc) and Effective Temporal F1 (EtF1) as evaluation metrics.
- Dataset: We curate a high‑quality OMTG dataset comprising 56 k samples through a sophisticated construction pipeline.
- Reward Functions: We develop novel temporal and caption reward functions specifically designed for OMTG. The caption reward leverages Chain‑of‑Thought reasoning over dense video captions to explicitly guide policy optimization toward both preciseness and completeness.
Extensive experiments show our model achieves a new state‑of‑the‑art EtF1 of 43.65 % on OMTG Bench, outperforming Gemini 2.5 Pro and Seed‑1.8 by 15.85 % and 15.61 %, respectively.
Key Contributions
- cs.CV
- cs.AI
Methodology
Please refer to the full paper for detailed methodology.
Practical Implications
This research contributes to the advancement of cs.CV.
Authors
- Qi Xu
- Yue Tan
- Shihao Chen
- Jiahao Meng
- Anna Wang
- Shunping Ji
- Hao Fei
- Jason Li
Paper Information
- arXiv ID: 2606.06294v1
- Categories: cs.CV, cs.AI
- Published: June 4, 2026
- PDF: Download PDF