[Paper] TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs
Source: arXiv - 2512.14698v1
Overview
The paper introduces TimeLens, a systematic baseline that shows how to turn modern multimodal large language models (MLLMs) into strong video‑temporal‑grounding (VTG) engines. By cleaning up benchmark data and refining training recipes, the authors demonstrate that off‑the‑shelf MLLMs can be coaxed into beating many proprietary systems on the task of pinpointing when a described event occurs in a video.
Key Contributions
- TimeLens‑Bench – Re‑annotated, high‑quality versions of three popular VTG benchmarks that expose serious labeling errors in the original datasets.
- TimeLens‑100K – An automated pipeline that produces a 100 K‑clip, high‑fidelity training set for VTG, dramatically reducing noise in existing corpora.
- Interleaved Textual Encoding – A lightweight encoding scheme that injects explicit time‑slot tokens into the language model’s input, improving temporal reasoning without heavy architectural changes.
- RLVR (Reinforcement Learning with Verifiable Rewards) – A “thinking‑free” RL framework that trains the model to output precise timestamps, using automatically verifiable reward signals instead of costly human feedback.
- State‑of‑the‑Art Open‑Source Performance – The resulting TimeLens family outperforms all publicly available VTG models and even surpasses closed‑source giants such as GPT‑5 and Gemini‑2.5‑Flash on the cleaned benchmarks.
- Open Release – All code, data, and pretrained checkpoints are made publicly available to accelerate reproducibility and downstream innovation.
Methodology
- Diagnosing Benchmark Quality – The authors audited three widely used VTG datasets (e.g., ActivityNet‑Caption, Charades‑STA) and found mismatched timestamps, ambiguous queries, and missing events. They re‑annotated these samples under strict guidelines, creating TimeLens‑Bench.
- Building a Clean Training Corpus – Using a combination of off‑the‑shelf video captioners, temporal segment detectors, and a rule‑based validator, they automatically generated 100 K video‑query‑timestamp triples (TimeLens‑100K). Human spot‑checks confirmed > 95 % label accuracy.
- Model Architecture Tweaks – Instead of redesigning the vision encoder, they kept a frozen video backbone (e.g., CLIP‑ViT) and focused on the language side. Time tokens (
<t0>,<t1>, …) are interleaved with the query text, allowing the LLM to treat temporal markers as first‑class symbols. - Training via RLVR – After a short supervised warm‑up, the model is fine‑tuned with reinforcement learning. The reward is computed automatically: if the predicted interval overlaps the ground‑truth beyond a threshold (e.g., IoU > 0.5), the model receives a reward of 1, otherwise 0. This eliminates the need for expensive human‑in‑the‑loop reward models.
- Recipe Engineering – The authors experiment with curriculum learning (easy → hard queries), mixed‑precision training, and gradient‑accumulation schedules to keep compute modest while still achieving top performance.
Results & Findings
| Model (open‑source) | mIoU (TimeLens‑Bench) | Relative Gain vs. Baseline |
|---|---|---|
| Baseline MLLM (no tweaks) | 31.2 % | — |
| + Interleaved Encoding | 38.7 % | + 24 % |
| + RLVR training | 44.5 % | + 43 % |
| TimeLens‑L (largest) | 52.1 % | + 67 % |
| Proprietary GPT‑5* | 48.3 % | — |
| Proprietary Gemini‑2.5‑Flash* | 49.0 % | — |
Numbers for closed‑source models are taken from the authors’ reproduced evaluation on the cleaned benchmarks.
Key takeaways
- Cleaning the evaluation data alone reshuffles the leaderboard—models previously thought to be best drop dramatically.
- The interleaved time‑token trick yields a ~7 % absolute mIoU boost with virtually no extra compute.
- RLVR provides the biggest jump, confirming that a simple, verifiable reward signal is sufficient for precise temporal grounding.
Practical Implications
- Developer Tooling – TimeLens can be wrapped as a plug‑and‑play API that takes a natural‑language query and returns start/end timestamps, enabling features like “search‑by‑scene” in video editors or automated highlight generation for sports streams.
- Content Moderation – Precise VTG helps flag specific moments (e.g., violent or copyrighted segments) without scanning the entire video, saving bandwidth and compute.
- E‑Learning & Accessibility – Automatic alignment of lecture transcripts to video timelines makes it trivial to generate chapter markers or caption‑synchronized navigation.
- Low‑Cost Deployment – Because the approach relies on frozen vision encoders and modest RL fine‑tuning, companies can fine‑tune a TimeLens model on their own domain data (e.g., product demos) with a single GPU budget.
Limitations & Future Work
- Domain Shift – The current training set focuses on generic, open‑domain videos; performance may degrade on highly specialized domains (medical procedures, industrial inspections) without additional fine‑tuning.
- Temporal Granularity – The model predicts coarse intervals (seconds‑level). Sub‑second precision, required for some AR/VR applications, remains an open challenge.
- Reward Simplicity – RLVR uses a binary IoU threshold; richer reward shaping (e.g., penalizing early/late drift) could further improve accuracy.
- Scalability of Re‑annotation – While the automated pipeline scales, fully eliminating human oversight for edge cases is still an open research question.
The authors plan to extend TimeLens to multi‑event grounding (handling multiple queries per video) and to explore joint audio‑visual temporal reasoning in future releases.
Authors
- Jun Zhang
- Teng Wang
- Yuying Ge
- Yixiao Ge
- Xinhao Li
- Ying Shan
- Limin Wang
Paper Information
- arXiv ID: 2512.14698v1
- Categories: cs.CV, cs.AI, cs.CL, cs.MM
- Published: December 16, 2025
- PDF: Download PDF