[Paper] Chatting with Images for Introspective Visual Thinking

Published: 3 days ago (February 11, 2026 at 12:42 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.11073v1

Overview

The paper introduces ViLaVT, a new large vision‑language model that treats visual manipulation as a “chat” between language prompts and image features. By letting the model talk to the visual encoder and re‑encode image regions on the fly, it preserves fine‑grained visual details and improves reasoning over multiple images or video frames—something that traditional single‑pass LVLMs struggle with.

Key Contributions

Chat‑style visual reasoning: Reframes image manipulation as language‑guided feature modulation, enabling dynamic, joint re‑encoding of image regions.
Dynamic vision encoder: Designed to accept and act on textual prompts, updating visual representations iteratively rather than once.
Two‑stage curriculum: Combines supervised fine‑tuning with reinforcement learning to teach the model when and how to request visual updates.
Broad benchmark gains: Sets new state‑of‑the‑art results on eight vision‑language tasks, especially those requiring multi‑image or video spatial reasoning.
Open‑source implementation: Provides code and pretrained weights, facilitating reproducibility and downstream integration.

Methodology

Prompt‑driven feature modulation:
- The language model generates a textual instruction (e.g., “focus on the leftmost object” or “compare the size of the two cars”).
- This instruction is encoded and fed into a modulation network that adjusts the attention weights of the vision encoder, effectively “telling” it which regions to re‑process.
Dynamic Vision Encoder (DVE):
- Built on a transformer backbone that can be invoked multiple times within a single inference pass.
- Each invocation takes the current visual features plus the modulation signal and outputs an updated representation.
Training pipeline:
- Stage 1 – Supervised fine‑tuning: The model learns to follow ground‑truth reasoning traces (e.g., step‑by‑step explanations) on a curated dataset of image‑question‑answer pairs.
- Stage 2 – Reinforcement learning (RL): Using a reward model that scores answer correctness and reasoning coherence, the system learns when to request additional visual updates versus when to answer directly.
Inference loop:
- The system alternates between generating a reasoning step (text) and, if needed, issuing a visual query that triggers the DVE to re‑encode targeted regions.
- This loop continues until the model decides it has enough visual evidence to produce the final answer.

Results & Findings

Benchmark	Baseline LVLM	ViLaVT (ours)	Δ (absolute)
VQA‑2 (single image)	71.3%	73.8%	+2.5
NLVR2 (multi‑image)	68.1%	74.6%	+6.5
VideoQA (temporal reasoning)	62.4%	70.2%	+7.8
RefCOCO (referring expression)	78.9%	81.5%	+2.6
Average across 8 tasks	70.2%	77.1%	+6.9

The biggest jumps appear on tasks that require spatial relationships across distant regions or multiple frames, confirming that the interactive re‑encoding effectively preserves fine‑grained visual cues.
Ablation studies show that removing the RL stage drops performance by ~3 pts, while disabling the dynamic encoder (fallback to single‑pass) reduces gains to <1 pt, highlighting the synergy of both components.

Practical Implications

Enhanced multimodal assistants: Developers can build chatbots that ask follow‑up visual questions (“show me the left side again”) without re‑sending the whole image, saving bandwidth and latency.
Robust visual QA for surveillance & robotics: Systems can iteratively focus on regions of interest (e.g., “track the moving object across frames”) while maintaining a coherent textual narrative.
Improved content moderation: By dynamically zooming into suspicious areas based on textual cues, models can better detect policy violations in images or short videos.
Tool integration: The architecture fits neatly into existing LLM pipelines (e.g., OpenAI’s function calling) – the “function” simply triggers the DVE with a modulation payload.
Developer-friendly API: The open‑source repo ships with a Python wrapper that abstracts the chat loop into a single ask(image, question) call, handling internal re‑encoding automatically.

Limitations & Future Work

Computation cost: Re‑encoding image regions multiple times increases GPU memory and inference latency compared to static encoders.
Prompt design sensitivity: The quality of the language‑driven modulation depends on well‑crafted prompts; noisy or ambiguous instructions can lead to unnecessary visual updates.
Scalability to long videos: Current experiments cap at short clips (≤5 s). Extending the approach to hour‑long streams will require smarter temporal summarization.
Future directions: The authors suggest integrating learned prompt generators to automate modulation signals, exploring sparse attention tricks to reduce compute, and applying the framework to 3‑D point‑cloud reasoning.

Authors

Junfei Wu
Jian Guan
Qiang Liu
Shu Wu
Liang Wang
Wei Wu
Tienie Tan

Paper Information

arXiv ID: 2602.11073v1
Categories: cs.CV, cs.AI, cs.CL
Published: February 11, 2026
PDF: Download PDF

[Paper] Chatting with Images for Introspective Visual Thinking

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] UniT: Unified Multimodal Chain-of-Thought Test-time Scaling

[Paper] Agentic Test-Time Scaling for WebAgents

[Paper] MonarchRT: Efficient Attention for Real-Time Video Generation

[Paper] T3D: Few-Step Diffusion Language Models via Trajectory Self-Distillation with Direct Discriminative Optimization