[Paper] Video-CoM: Interactive Video Reasoning via Chain of Manipulations
Source: arXiv - 2511.23477v1
Overview
The paper introduces Video‑CoM, a new approach for “interactive video reasoning” that lets a model treat a video as an active workspace rather than a static snapshot. By iteratively performing visual manipulations—rewinding, zooming, focusing on regions, and extracting frames—the model can gather evidence step‑by‑step, leading to deeper spatio‑temporal understanding and higher accuracy on challenging video‑question‑answering tasks.
Key Contributions
- Interactive Reasoning Paradigm: Shifts from passive “think‑only‑once” video encoding to a loop where the model can re‑watch and refocus on video segments during inference.
- Chain of Manipulations (CoM): A structured sequence of visual actions (e.g., temporal cropping, spatial zoom, object tracking) that the model learns to execute to collect evidence.
- Video‑CoM‑Instruct Dataset: 18 K instruction‑tuned examples specifically designed for multi‑step manipulation reasoning.
- GRPO Training: A novel Group Relative Policy Optimization reinforcement‑learning algorithm that supplies step‑level reasoning rewards, encouraging consistent and grounded manipulation policies.
- Strong Empirical Gains: Improves average performance by 3.6 % across nine video‑reasoning benchmarks while using only ~25 K supervised and 3 K RL samples—far fewer than competing large‑scale models.
- Interpretability: The manipulation chain is human‑readable, making it easier to debug and trust model decisions.
Methodology
- Video Workspace: The raw video is stored as a mutable buffer. The model can query this buffer with actions such as
seek(t),crop(x,y,w,h),track(object), orsample_frame(). - Language‑Vision Loop:
- The LLM receives a textual prompt (question + instructions).
- It outputs a manipulation command plus a short textual rationale.
- The command is executed on the video buffer, producing a visual observation (e.g., a cropped frame).
- The observation is encoded and fed back to the LLM, which decides the next step.
- The loop continues until a termination token signals that enough evidence has been gathered to answer.
- Training Pipeline:
- Supervised Fine‑Tuning (SFT) on the Video‑CoM‑Instruct dataset teaches the model the basic mapping from questions to manipulation sequences.
- Reinforcement Learning (GRPO) refines the policy using two reward signals: (a) answer correctness (sparse) and (b) step‑level reasoning quality (dense), measured by alignment between generated rationales and ground‑truth evidence.
- Model Architecture: A frozen multimodal encoder (e.g., CLIP‑ViT) processes visual observations, while a decoder‑only LLM (e.g., LLaMA‑2) handles the language side and predicts the next action token sequence.
Results & Findings
| Benchmark | Prior SOTA | Video‑CoM (ours) | Δ |
|---|---|---|---|
| MSVD‑QA | 71.2 % | 75.4 % | +4.2 |
| TGIF‑QA | 68.9 % | 73.1 % | +4.2 |
| ActivityNet‑QA | 63.5 % | 66.8 % | +3.3 |
| … (total 9) | — | +3.6 % avg | — |
- Sample Efficiency: Achieves these gains with only ~28 K total training examples, compared to >1 M video‑text pairs used by many competing MLLMs.
- Ablation: Removing step‑level rewards drops accuracy by ~2 % and makes the manipulation chain noisier, confirming the importance of reasoning‑aware RL.
- Interpretability: Visualizing the manipulation chain reveals that the model often isolates the exact temporal window and spatial region needed to answer the question, something baseline models cannot do.
Practical Implications
- Debuggable Video AI: Developers can inspect the manipulation chain to understand why a model answered a certain way, simplifying troubleshooting in safety‑critical domains (e.g., autonomous driving video logs).
- Reduced Data Costs: The sample‑efficient training regime means companies can fine‑tune powerful video reasoners on proprietary video corpora without needing massive annotation budgets.
- Enhanced Interactive Apps: Voice‑controlled assistants, video editors, or surveillance analytics can ask follow‑up questions (“show me the moment the person turned left”) and receive concrete visual evidence generated on‑the‑fly.
- Modular Integration: Because the visual actions are defined as API‑like commands, Video‑CoM can be plugged into existing video pipelines (FFmpeg, OpenCV) without redesigning the whole model stack.
- Better Grounding for LLMs: The approach demonstrates a concrete path to give large language models active perception capabilities, a stepping stone toward more general AI assistants that can manipulate their sensory inputs.
Limitations & Future Work
- Action Space Simplicity: Current manipulations are limited to basic cropping, temporal seeking, and object tracking; richer actions (e.g., optical‑flow analysis, 3D pose estimation) could further boost reasoning depth.
- Scalability to Long‑Form Video: The workspace assumes the entire video can be randomly accessed; streaming or extremely long videos may require hierarchical buffering strategies.
- Reward Design: While step‑level reasoning rewards improve performance, they rely on heuristics (e.g., overlap with ground‑truth evidence) that may not generalize to completely novel domains.
- Generalization to Multi‑Modal Inputs: Extending the paradigm to audio, subtitles, or sensor data remains an open challenge.
The authors suggest exploring richer manipulation primitives, hierarchical video memory, and joint audio‑visual reasoning as next steps.
Authors
- Hanoona Rasheed
- Mohammed Zumri
- Muhammad Maaz
- Ming-Hsuan Yang
- Fahad Shahbaz Khan
- Salman Khan
Paper Information
- arXiv ID: 2511.23477v1
- Categories: cs.CV
- Published: November 28, 2025
- PDF: Download PDF