[Paper] Towards One-to-Many Temporal Grounding

Published: (June 4, 2026 at 11:31 AM EDT)
2 min read
Source: arXiv

Source: arXiv - 2606.06294v1

Overview

Temporal Grounding (TG) aims to localize video segments corresponding to a textual query. Prior research predominantly focuses on single‑segment retrieval. Real‑world scenarios, however, often require localizing multiple disjoint segments for a single query — a setting we term One-to-Many Temporal Grounding (OMTG). Previous state‑of‑the‑art MLLMs, optimized for one‑to‑one settings, struggle in this context, often yielding near‑zero scores due to a lack of event cardinality perception.

To bridge this gap, we present a systematic solution with three key contributions:

  1. Benchmark: We establish the first comprehensive OMTG benchmark, introducing Count Accuracy (C‑Acc) and Effective Temporal F1 (EtF1) as evaluation metrics.
  2. Dataset: We curate a high‑quality OMTG dataset comprising 56 k samples through a sophisticated construction pipeline.
  3. Reward Functions: We develop novel temporal and caption reward functions specifically designed for OMTG. The caption reward leverages Chain‑of‑Thought reasoning over dense video captions to explicitly guide policy optimization toward both preciseness and completeness.

Extensive experiments show our model achieves a new state‑of‑the‑art EtF1 of 43.65 % on OMTG Bench, outperforming Gemini 2.5 Pro and Seed‑1.8 by 15.85 % and 15.61 %, respectively.

Key Contributions

  • cs.CV
  • cs.AI

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.CV.

Authors

  • Qi Xu
  • Yue Tan
  • Shihao Chen
  • Jiahao Meng
  • Anna Wang
  • Shunping Ji
  • Hao Fei
  • Jason Li

Paper Information

  • arXiv ID: 2606.06294v1
  • Categories: cs.CV, cs.AI
  • Published: June 4, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »