[Paper] Thinking With Bounding Boxes: Enhancing Spatio-Temporal Video Grounding via Reinforcement Fine-Tuning
Source: arXiv - 2511.21375v1
Overview
The paper introduces STVG‑o1, a novel framework that lets off‑the‑shelf multimodal large language models (MLLMs) excel at spatio‑temporal video grounding (STVG) – the task of pinpointing when and where an object described in natural language appears in an untrimmed video. By adding a “bounding‑box chain‑of‑thought” reasoning step and a multi‑dimensional reinforcement‑learning fine‑tune, the authors achieve state‑of‑the‑art results without redesigning the underlying model architecture.
Key Contributions
- Bounding‑box chain‑of‑thought: an explicit intermediate reasoning stage where the model predicts a sequence of bounding boxes before emitting the final grounding answer.
- Reinforcement fine‑tuning: a custom reward function that jointly evaluates format correctness, temporal alignment, spatial overlap, consistency, and the quality of the chain‑of‑thought.
- Zero‑modification integration: STVG‑o1 works with any pre‑trained MLLM (e.g., LLaVA, MiniGPT‑4) without architectural changes, turning them into high‑performing STVG systems.
- Open‑vocabulary generalization: the approach transfers across datasets (HCSTVG‑v1/v2, VidSTG) and handles novel object categories not seen during training.
- State‑of‑the‑art performance: beats the previous best task‑specific model by +7.3 % m_tIoU on HCSTVG‑v1 and matches specialized methods on VidSTG, while surpassing all prior MLLM‑based baselines.
Methodology
- Prompt Engineering – The video and the natural‑language query are fed to an existing MLLM together with a chain‑of‑thought template that asks the model to “think step‑by‑step” and output a series of bounding‑box coordinates (frame index + box).
- Bounding‑Box Generation – The model produces a temporal‑spatial trajectory as text (e.g.,
frame 12: [x1,y1,x2,y2]). This intermediate output is parsed back into numeric boxes. - Reinforcement Fine‑Tuning – Using the parsed boxes, a multi‑dimensional reward is computed:
- Format reward – penalizes malformed strings.
- Consistency reward – encourages smooth motion across consecutive frames.
- Temporal reward – aligns predicted start/end frames with ground‑truth.
- Spatial reward – measures IoU (intersection‑over‑union) with the true boxes.
- Think reward – rewards concise, logical chain‑of‑thought narratives.
The model is then updated with a policy‑gradient algorithm (e.g., REINFORCE) to maximize the expected reward, effectively teaching it to “think” in bounding‑box terms.
- Final Prediction – After fine‑tuning, the model directly outputs the best‑scoring bounding‑box sequence, which can be used by downstream systems (e.g., video editors, surveillance analytics).
Results & Findings
| Dataset | Metric (m_tIoU) | Improvement vs. prior SOTA |
|---|---|---|
| HCSTVG‑v1 | 71.2 (↑ 7.3) | Beats the best task‑specific model |
| HCSTVG‑v2 | 68.5 | Comparable to specialized methods |
| VidSTG | 44.1 | Matches dedicated VidSTG models |
- Open‑vocabulary: When evaluated on a dataset with unseen object names, STVG‑o1 retained >80 % of its performance, demonstrating that the chain‑of‑thought reasoning generalizes beyond the training vocabulary.
- Ablation: Removing the think‑reward drops m_tIoU by ~2 %, while skipping the chain‑of‑thought step reduces performance by >5 %, confirming both components are essential.
- Speed: Because the approach reuses the base MLLM inference pipeline, runtime overhead is modest (~1.2× slower than vanilla MLLM inference) and still suitable for interactive applications.
Practical Implications
- Developer‑friendly integration – Teams can plug STVG‑o1 into existing LLM‑powered products (e.g., chat‑based video assistants, AI video editors) without rewriting model code.
- Enhanced video search – Precise spatio‑temporal grounding enables “find the moment when the red car passes the bridge” queries, improving content management systems and media archives.
- Surveillance & robotics – Real‑time grounding of natural‑language commands (e.g., “track the person in the blue jacket for the next 10 seconds”) becomes feasible with off‑the‑shelf models.
- Open‑vocabulary UI – Users can refer to arbitrary objects or actions, and the system will still locate them, reducing the need for exhaustive label taxonomies.
- Reduced engineering effort – By avoiding custom vision‑language architectures, companies can leverage the rapid iteration cycles of MLLM ecosystems (updates, scaling, quantization) while still achieving high‑precision grounding.
Limitations & Future Work
- Data‑efficiency – The reinforcement fine‑tuning still requires a modest amount of annotated video‑grounding data; scaling to truly zero‑shot scenarios remains open.
- Temporal granularity – The current chain‑of‑thought predicts a box per frame; for very long videos this can be computationally heavy. Future work could explore hierarchical or key‑frame summarization.
- Robustness to noisy language – Ambiguous or colloquial queries sometimes lead to divergent chain‑of‑thoughts; incorporating uncertainty estimation could improve reliability.
- Cross‑modal consistency – While the think‑reward encourages logical reasoning, deeper integration of visual attention maps with the textual chain‑of‑thought could further tighten spatial accuracy.
Stay tuned – the authors promise to release code and pretrained checkpoints, which should make experimentation and adoption straightforward for the developer community.
Authors
- Xin Gu
- Haoji Zhang
- Qihang Fan
- Jingxuan Niu
- Zhipeng Zhang
- Libo Zhang
- Guang Chen
- Fan Chen
- Longyin Wen
- Sijie Zhu
Paper Information
- arXiv ID: 2511.21375v1
- Categories: cs.CV
- Published: November 26, 2025
- PDF: Download PDF