[Paper] CoV: Chain-of-View Prompting for Spatial Reasoning
Source: arXiv - 2601.05172v1
Overview
The paper “CoV: Chain‑of‑View Prompting for Spatial Reasoning” tackles a core bottleneck in embodied question answering (EQA) – how a vision‑language model (VLM) can gather the right visual context when the answer is spread across many viewpoints in a 3‑D scene. By turning a static VLM into an active observer that decides where to look next, the authors achieve large, training‑free gains on several benchmark datasets.
Key Contributions
- Chain‑of‑View (CoV) prompting: a test‑time framework that lets any off‑the‑shelf VLM iteratively select and move to new camera viewpoints, mimicking a human’s “look‑around” behavior.
- View Selection agent: a lightweight module that filters out redundant frames and picks an initial “anchor” view aligned with the question, reducing unnecessary computation.
- Fine‑grained view adjustment loop: interleaves LLM‑style reasoning with discrete camera actions, pulling fresh observations from the underlying 3‑D scene until enough evidence is collected or a step budget expires.
- Model‑agnostic performance boost: across four mainstream VLMs (e.g., Qwen‑3‑VL‑Flash, Gemini‑2.5‑Flash) the method adds an average +11.56 % in LLM‑Match accuracy on the OpenEQA benchmark, with up to +13.62 % for a single model.
- Scalable test‑time budget: increasing the allowed number of view‑shifts yields further improvements (up to +3.73 %), showing the approach can trade compute for accuracy.
- Strong cross‑dataset results: competitive CIDEr and exact‑match scores on ScanQA and SQA3D without any additional training data.
Methodology
- Input – A static VLM receives a set of pre‑rendered images from a 3‑D environment and a natural‑language question.
- Coarse view selection – A View Selection agent (implemented as a simple similarity scorer) evaluates all available frames, discarding duplicates and picking the most question‑relevant “anchor” view.
- Iterative fine‑grained search – Starting from the anchor, the system enters a loop:
- The VLM processes the current view together with the question and produces a short reasoning snippet.
- Based on this snippet, a discrete camera policy decides a next action (e.g., rotate left, move forward).
- The environment renders the new viewpoint, feeding it back into the VLM.
- The loop stops when a confidence threshold is met or a pre‑defined step budget is exhausted.
- Answer extraction – The final reasoning output is parsed by the VLM’s language head to produce the answer.
The whole pipeline requires no gradient updates; it works as a plug‑in on top of any existing VLM.
Results & Findings
| Benchmark | Baseline (no CoV) | +CoV (average) | Best model gain |
|---|---|---|---|
| OpenEQA (LLM‑Match) | – | +11.56 % | +13.62 % (Qwen‑3‑VL‑Flash) |
| OpenEQA (budget scaling) | – | +2.51 % (average) | +3.73 % (Gemini‑2.5‑Flash) |
| ScanQA (CIDEr / EM@1) | – | 116 CIDEr / 31.9 % | – |
| SQA3D (EM@1) | – | 51.1 % | – |
Key takeaways
- The improvement is consistent across models, confirming that CoV is truly model‑agnostic.
- Adding more view‑shifts yields diminishing but still positive returns, indicating a practical trade‑off between latency and accuracy.
- Even on datasets not used during development (ScanQA, SQA3D), CoV delivers strong absolute scores, suggesting good generalization.
Practical Implications
- Robotics & AR/VR – Developers building embodied agents (e.g., home robots, virtual assistants) can plug CoV into their perception stack to let the robot “look around” for missing clues without retraining the visual backbone.
- Zero‑shot deployment – Since CoV works at inference time only, companies can upgrade existing VLM‑powered products with better spatial reasoning simply by adding the view‑selection and action loop.
- Cost‑effective scaling – The method lets teams balance compute budget against answer quality—use a tighter step budget for latency‑critical applications, or a larger budget when accuracy is paramount (e.g., inspection drones).
- Cross‑modal research – The coarse‑to‑fine prompting paradigm can inspire similar active‑query techniques for audio, multimodal navigation, or even code‑base exploration where “views” are abstract states rather than camera angles.
Limitations & Future Work
- Discrete action space – The current camera policy uses a small set of predefined moves; finer or continuous motions could capture subtler context but would require more sophisticated planning.
- Step‑budget dependency – While performance scales with more steps, real‑time systems may be constrained by latency; adaptive budgeting strategies are an open question.
- Environment fidelity – Experiments rely on simulated 3‑D datasets; transferring to noisy, real‑world sensor streams (e.g., depth noise, lighting changes) may expose robustness gaps.
- View selection heuristics – The anchor‑view selector is a simple similarity filter; learning a more nuanced selector (perhaps via reinforcement learning) could further reduce unnecessary views.
The authors suggest exploring continuous camera controls, adaptive budgeting, and real‑world robot trials as next steps.
Authors
- Haoyu Zhao
- Akide Liu
- Zeyu Zhang
- Weijie Wang
- Feng Chen
- Ruihan Zhu
- Gholamreza Haffari
- Bohan Zhuang
Paper Information
- arXiv ID: 2601.05172v1
- Categories: cs.CV, cs.AI
- Published: January 8, 2026
- PDF: Download PDF