[Paper] TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs

Published: 1 month ago (December 16, 2025 at 01:59 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.14698v1

Overview

The paper introduces TimeLens, a systematic baseline that shows how to turn modern multimodal large language models (MLLMs) into strong video‑temporal‑grounding (VTG) engines. By cleaning up benchmark data and refining training recipes, the authors demonstrate that off‑the‑shelf MLLMs can be coaxed into beating many proprietary systems on the task of pinpointing when a described event occurs in a video.

Key Contributions

TimeLens‑Bench – Re‑annotated, high‑quality versions of three popular VTG benchmarks that expose serious labeling errors in the original datasets.
TimeLens‑100K – An automated pipeline that produces a 100 K‑clip, high‑fidelity training set for VTG, dramatically reducing noise in existing corpora.
Interleaved Textual Encoding – A lightweight encoding scheme that injects explicit time‑slot tokens into the language model’s input, improving temporal reasoning without heavy architectural changes.
RLVR (Reinforcement Learning with Verifiable Rewards) – A “thinking‑free” RL framework that trains the model to output precise timestamps, using automatically verifiable reward signals instead of costly human feedback.
State‑of‑the‑Art Open‑Source Performance – The resulting TimeLens family outperforms all publicly available VTG models and even surpasses closed‑source giants such as GPT‑5 and Gemini‑2.5‑Flash on the cleaned benchmarks.
Open Release – All code, data, and pretrained checkpoints are made publicly available to accelerate reproducibility and downstream innovation.

Methodology

Diagnosing Benchmark Quality – The authors audited three widely used VTG datasets (e.g., ActivityNet‑Caption, Charades‑STA) and found mismatched timestamps, ambiguous queries, and missing events. They re‑annotated these samples under strict guidelines, creating TimeLens‑Bench.
Building a Clean Training Corpus – Using a combination of off‑the‑shelf video captioners, temporal segment detectors, and a rule‑based validator, they automatically generated 100 K video‑query‑timestamp triples (TimeLens‑100K). Human spot‑checks confirmed > 95 % label accuracy.
Model Architecture Tweaks – Instead of redesigning the vision encoder, they kept a frozen video backbone (e.g., CLIP‑ViT) and focused on the language side. Time tokens (<t0>, <t1>, …) are interleaved with the query text, allowing the LLM to treat temporal markers as first‑class symbols.
Training via RLVR – After a short supervised warm‑up, the model is fine‑tuned with reinforcement learning. The reward is computed automatically: if the predicted interval overlaps the ground‑truth beyond a threshold (e.g., IoU > 0.5), the model receives a reward of 1, otherwise 0. This eliminates the need for expensive human‑in‑the‑loop reward models.
Recipe Engineering – The authors experiment with curriculum learning (easy → hard queries), mixed‑precision training, and gradient‑accumulation schedules to keep compute modest while still achieving top performance.

Results & Findings

Model (open‑source)	mIoU (TimeLens‑Bench)	Relative Gain vs. Baseline
Baseline MLLM (no tweaks)	31.2 %	—
+ Interleaved Encoding	38.7 %	+ 24 %
+ RLVR training	44.5 %	+ 43 %
TimeLens‑L (largest)	52.1 %	+ 67 %
Proprietary GPT‑5*	48.3 %	—
Proprietary Gemini‑2.5‑Flash*	49.0 %	—

Numbers for closed‑source models are taken from the authors’ reproduced evaluation on the cleaned benchmarks.

Key takeaways

Cleaning the evaluation data alone reshuffles the leaderboard—models previously thought to be best drop dramatically.
The interleaved time‑token trick yields a ~7 % absolute mIoU boost with virtually no extra compute.
RLVR provides the biggest jump, confirming that a simple, verifiable reward signal is sufficient for precise temporal grounding.

Practical Implications

Developer Tooling – TimeLens can be wrapped as a plug‑and‑play API that takes a natural‑language query and returns start/end timestamps, enabling features like “search‑by‑scene” in video editors or automated highlight generation for sports streams.
Content Moderation – Precise VTG helps flag specific moments (e.g., violent or copyrighted segments) without scanning the entire video, saving bandwidth and compute.
E‑Learning & Accessibility – Automatic alignment of lecture transcripts to video timelines makes it trivial to generate chapter markers or caption‑synchronized navigation.
Low‑Cost Deployment – Because the approach relies on frozen vision encoders and modest RL fine‑tuning, companies can fine‑tune a TimeLens model on their own domain data (e.g., product demos) with a single GPU budget.

Limitations & Future Work

Domain Shift – The current training set focuses on generic, open‑domain videos; performance may degrade on highly specialized domains (medical procedures, industrial inspections) without additional fine‑tuning.
Temporal Granularity – The model predicts coarse intervals (seconds‑level). Sub‑second precision, required for some AR/VR applications, remains an open challenge.
Reward Simplicity – RLVR uses a binary IoU threshold; richer reward shaping (e.g., penalizing early/late drift) could further improve accuracy.
Scalability of Re‑annotation – While the automated pipeline scales, fully eliminating human oversight for edge cases is still an open research question.

The authors plan to extend TimeLens to multi‑event grounding (handling multiple queries per video) and to explore joint audio‑visual temporal reasoning in future releases.

Authors

Jun Zhang
Teng Wang
Yuying Ge
Yixiao Ge
Xinhao Li
Ying Shan
Limin Wang

Paper Information

arXiv ID: 2512.14698v1
Categories: cs.CV, cs.AI, cs.CL, cs.MM
Published: December 16, 2025
PDF: Download PDF

[Paper] TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs

Overview

Key Contributions

Methodology

Results & Findings

Key takeaways

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] VTCBench: Can Vision-Language Models Understand Long Context with Vision-Text Compression?

[Paper] JMMMU-Pro: Image-based Japanese Multi-discipline Multimodal Understanding Benchmark via Vibe Benchmark Construction

[Paper] Re-Depth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting

[Paper] Adversarial Robustness of Vision in Open Foundation Models