[Paper] Thinking With Bounding Boxes: Enhancing Spatio-Temporal Video Grounding via Reinforcement Fine-Tuning

Published: 2 months ago (November 26, 2025 at 08:21 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2511.21375v1

Overview

The paper introduces STVG‑o1, a novel framework that lets off‑the‑shelf multimodal large language models (MLLMs) excel at spatio‑temporal video grounding (STVG) – the task of pinpointing when and where an object described in natural language appears in an untrimmed video. By adding a “bounding‑box chain‑of‑thought” reasoning step and a multi‑dimensional reinforcement‑learning fine‑tune, the authors achieve state‑of‑the‑art results without redesigning the underlying model architecture.

Key Contributions

Bounding‑box chain‑of‑thought: an explicit intermediate reasoning stage where the model predicts a sequence of bounding boxes before emitting the final grounding answer.
Reinforcement fine‑tuning: a custom reward function that jointly evaluates format correctness, temporal alignment, spatial overlap, consistency, and the quality of the chain‑of‑thought.
Zero‑modification integration: STVG‑o1 works with any pre‑trained MLLM (e.g., LLaVA, MiniGPT‑4) without architectural changes, turning them into high‑performing STVG systems.
Open‑vocabulary generalization: the approach transfers across datasets (HCSTVG‑v1/v2, VidSTG) and handles novel object categories not seen during training.
State‑of‑the‑art performance: beats the previous best task‑specific model by +7.3 % m_tIoU on HCSTVG‑v1 and matches specialized methods on VidSTG, while surpassing all prior MLLM‑based baselines.

Methodology

Prompt Engineering – The video and the natural‑language query are fed to an existing MLLM together with a chain‑of‑thought template that asks the model to “think step‑by‑step” and output a series of bounding‑box coordinates (frame index + box).
Bounding‑Box Generation – The model produces a temporal‑spatial trajectory as text (e.g., frame 12: [x1,y1,x2,y2]). This intermediate output is parsed back into numeric boxes.
Reinforcement Fine‑Tuning – Using the parsed boxes, a multi‑dimensional reward is computed:
- Format reward – penalizes malformed strings.
- Consistency reward – encourages smooth motion across consecutive frames.
- Temporal reward – aligns predicted start/end frames with ground‑truth.
- Spatial reward – measures IoU (intersection‑over‑union) with the true boxes.
- Think reward – rewards concise, logical chain‑of‑thought narratives.
  The model is then updated with a policy‑gradient algorithm (e.g., REINFORCE) to maximize the expected reward, effectively teaching it to “think” in bounding‑box terms.
Final Prediction – After fine‑tuning, the model directly outputs the best‑scoring bounding‑box sequence, which can be used by downstream systems (e.g., video editors, surveillance analytics).

Results & Findings

Dataset	Metric (m_tIoU)	Improvement vs. prior SOTA
HCSTVG‑v1	71.2 (↑ 7.3)	Beats the best task‑specific model
HCSTVG‑v2	68.5	Comparable to specialized methods
VidSTG	44.1	Matches dedicated VidSTG models

Open‑vocabulary: When evaluated on a dataset with unseen object names, STVG‑o1 retained >80 % of its performance, demonstrating that the chain‑of‑thought reasoning generalizes beyond the training vocabulary.
Ablation: Removing the think‑reward drops m_tIoU by ~2 %, while skipping the chain‑of‑thought step reduces performance by >5 %, confirming both components are essential.
Speed: Because the approach reuses the base MLLM inference pipeline, runtime overhead is modest (~1.2× slower than vanilla MLLM inference) and still suitable for interactive applications.

Practical Implications

Developer‑friendly integration – Teams can plug STVG‑o1 into existing LLM‑powered products (e.g., chat‑based video assistants, AI video editors) without rewriting model code.
Enhanced video search – Precise spatio‑temporal grounding enables “find the moment when the red car passes the bridge” queries, improving content management systems and media archives.
Surveillance & robotics – Real‑time grounding of natural‑language commands (e.g., “track the person in the blue jacket for the next 10 seconds”) becomes feasible with off‑the‑shelf models.
Open‑vocabulary UI – Users can refer to arbitrary objects or actions, and the system will still locate them, reducing the need for exhaustive label taxonomies.
Reduced engineering effort – By avoiding custom vision‑language architectures, companies can leverage the rapid iteration cycles of MLLM ecosystems (updates, scaling, quantization) while still achieving high‑precision grounding.

Limitations & Future Work

Data‑efficiency – The reinforcement fine‑tuning still requires a modest amount of annotated video‑grounding data; scaling to truly zero‑shot scenarios remains open.
Temporal granularity – The current chain‑of‑thought predicts a box per frame; for very long videos this can be computationally heavy. Future work could explore hierarchical or key‑frame summarization.
Robustness to noisy language – Ambiguous or colloquial queries sometimes lead to divergent chain‑of‑thoughts; incorporating uncertainty estimation could improve reliability.
Cross‑modal consistency – While the think‑reward encourages logical reasoning, deeper integration of visual attention maps with the textual chain‑of‑thought could further tighten spatial accuracy.

Stay tuned – the authors promise to release code and pretrained checkpoints, which should make experimentation and adoption straightforward for the developer community.

Authors

Xin Gu
Haoji Zhang
Qihang Fan
Jingxuan Niu
Zhipeng Zhang
Libo Zhang
Guang Chen
Fan Chen
Longyin Wen
Sijie Zhu

Paper Information

arXiv ID: 2511.21375v1
Categories: cs.CV
Published: November 26, 2025
PDF: Download PDF

[Paper] Thinking With Bounding Boxes: Enhancing Spatio-Temporal Video Grounding via Reinforcement Fine-Tuning

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Video-R2: Reinforcing Consistent and Grounded Reasoning in Multimodal Language Models

[Paper] Video-CoM: Interactive Video Reasoning via Chain of Manipulations

[Paper] AnyTalker: Scaling Multi-Person Talking Video Generation with Interactivity Refinement

[Paper] Visual Generation Tuning