[Paper] Thinking With Bounding Boxes: Enhancing Spatio-Temporal Video Grounding via Reinforcement Fine-Tuning

Published: (November 26, 2025 at 08:21 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2511.21375v1

Overview

The paper introduces STVG‑o1, a novel framework that lets off‑the‑shelf multimodal large language models (MLLMs) excel at spatio‑temporal video grounding (STVG) – the task of pinpointing when and where an object described in natural language appears in an untrimmed video. By adding a “bounding‑box chain‑of‑thought” reasoning step and a multi‑dimensional reinforcement‑learning fine‑tune, the authors achieve state‑of‑the‑art results without redesigning the underlying model architecture.

Key Contributions

  • Bounding‑box chain‑of‑thought: an explicit intermediate reasoning stage where the model predicts a sequence of bounding boxes before emitting the final grounding answer.
  • Reinforcement fine‑tuning: a custom reward function that jointly evaluates format correctness, temporal alignment, spatial overlap, consistency, and the quality of the chain‑of‑thought.
  • Zero‑modification integration: STVG‑o1 works with any pre‑trained MLLM (e.g., LLaVA, MiniGPT‑4) without architectural changes, turning them into high‑performing STVG systems.
  • Open‑vocabulary generalization: the approach transfers across datasets (HCSTVG‑v1/v2, VidSTG) and handles novel object categories not seen during training.
  • State‑of‑the‑art performance: beats the previous best task‑specific model by +7.3 % m_tIoU on HCSTVG‑v1 and matches specialized methods on VidSTG, while surpassing all prior MLLM‑based baselines.

Methodology

  1. Prompt Engineering – The video and the natural‑language query are fed to an existing MLLM together with a chain‑of‑thought template that asks the model to “think step‑by‑step” and output a series of bounding‑box coordinates (frame index + box).
  2. Bounding‑Box Generation – The model produces a temporal‑spatial trajectory as text (e.g., frame 12: [x1,y1,x2,y2]). This intermediate output is parsed back into numeric boxes.
  3. Reinforcement Fine‑Tuning – Using the parsed boxes, a multi‑dimensional reward is computed:
    • Format reward – penalizes malformed strings.
    • Consistency reward – encourages smooth motion across consecutive frames.
    • Temporal reward – aligns predicted start/end frames with ground‑truth.
    • Spatial reward – measures IoU (intersection‑over‑union) with the true boxes.
    • Think reward – rewards concise, logical chain‑of‑thought narratives.
      The model is then updated with a policy‑gradient algorithm (e.g., REINFORCE) to maximize the expected reward, effectively teaching it to “think” in bounding‑box terms.
  4. Final Prediction – After fine‑tuning, the model directly outputs the best‑scoring bounding‑box sequence, which can be used by downstream systems (e.g., video editors, surveillance analytics).

Results & Findings

DatasetMetric (m_tIoU)Improvement vs. prior SOTA
HCSTVG‑v171.2 (↑ 7.3)Beats the best task‑specific model
HCSTVG‑v268.5Comparable to specialized methods
VidSTG44.1Matches dedicated VidSTG models
  • Open‑vocabulary: When evaluated on a dataset with unseen object names, STVG‑o1 retained >80 % of its performance, demonstrating that the chain‑of‑thought reasoning generalizes beyond the training vocabulary.
  • Ablation: Removing the think‑reward drops m_tIoU by ~2 %, while skipping the chain‑of‑thought step reduces performance by >5 %, confirming both components are essential.
  • Speed: Because the approach reuses the base MLLM inference pipeline, runtime overhead is modest (~1.2× slower than vanilla MLLM inference) and still suitable for interactive applications.

Practical Implications

  • Developer‑friendly integration – Teams can plug STVG‑o1 into existing LLM‑powered products (e.g., chat‑based video assistants, AI video editors) without rewriting model code.
  • Enhanced video search – Precise spatio‑temporal grounding enables “find the moment when the red car passes the bridge” queries, improving content management systems and media archives.
  • Surveillance & robotics – Real‑time grounding of natural‑language commands (e.g., “track the person in the blue jacket for the next 10 seconds”) becomes feasible with off‑the‑shelf models.
  • Open‑vocabulary UI – Users can refer to arbitrary objects or actions, and the system will still locate them, reducing the need for exhaustive label taxonomies.
  • Reduced engineering effort – By avoiding custom vision‑language architectures, companies can leverage the rapid iteration cycles of MLLM ecosystems (updates, scaling, quantization) while still achieving high‑precision grounding.

Limitations & Future Work

  • Data‑efficiency – The reinforcement fine‑tuning still requires a modest amount of annotated video‑grounding data; scaling to truly zero‑shot scenarios remains open.
  • Temporal granularity – The current chain‑of‑thought predicts a box per frame; for very long videos this can be computationally heavy. Future work could explore hierarchical or key‑frame summarization.
  • Robustness to noisy language – Ambiguous or colloquial queries sometimes lead to divergent chain‑of‑thoughts; incorporating uncertainty estimation could improve reliability.
  • Cross‑modal consistency – While the think‑reward encourages logical reasoning, deeper integration of visual attention maps with the textual chain‑of‑thought could further tighten spatial accuracy.

Stay tuned – the authors promise to release code and pretrained checkpoints, which should make experimentation and adoption straightforward for the developer community.

Authors

  • Xin Gu
  • Haoji Zhang
  • Qihang Fan
  • Jingxuan Niu
  • Zhipeng Zhang
  • Libo Zhang
  • Guang Chen
  • Fan Chen
  • Longyin Wen
  • Sijie Zhu

Paper Information

  • arXiv ID: 2511.21375v1
  • Categories: cs.CV
  • Published: November 26, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »