[Paper] No Labels, No Problem: Training Visual Reasoners with Multimodal Verifiers

Published: 2 months ago (December 9, 2025 at 01:30 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.08889v1

Overview

The paper introduces Valor, a new training framework that teaches visual reasoning systems to answer spatial queries without any human‑annotated labels. By pairing a large language model (LLM) with a vision‑language model (VLM) as “verifiers,” the authors let each model critique and improve the other’s output, turning the training process into a self‑supervised loop. The result is a system that grounds objects more accurately and reasons about their relationships better than existing open‑source and even many proprietary models.

Key Contributions

Annotation‑free training pipeline that jointly improves reasoning (via an LLM verifier) and visual grounding (via a VLM verifier).
Reinforcement‑learning loop where the LLM’s chain‑of‑thought reasoning is refined based on feedback from the LLM verifier.
Automated hard‑negative mining for the VLM verifier, which generates challenging false visual matches to sharpen grounding without any labeled bounding boxes.
Unified architecture that leverages the strengths of language‑only reasoning models and specialist vision models, avoiding the brittle program‑synthesis approaches of prior work.
State‑of‑the‑art performance on several benchmark spatial reasoning tasks, surpassing both open‑source and commercial baselines.

Methodology

Query Decomposition – An LLM receives a natural‑language spatial question (e.g., “Is the red ball left of the blue cube?”) and produces a step‑by‑step chain of thought that breaks the problem into sub‑tasks such as object detection, relation extraction, and logical aggregation.
LLM Verifier (RL feedback) – A separate LLM evaluates the generated reasoning trace, scoring its logical consistency and relevance. The original LLM is then fine‑tuned with reinforcement learning (RL) to maximize the verifier’s reward, encouraging clearer, more correct reasoning steps.
Visual Grounding via VLM Verifier – The VLM predicts region proposals for the objects mentioned in the chain of thought. A VLM‑based critic automatically creates hard‑negative examples (e.g., swapping “left” with “right”) and trains the VLM to discriminate correct from incorrect grounding, all without ground‑truth boxes.
Joint Optimization – The two verifiers operate in tandem: improved grounding feeds better visual evidence to the LLM’s reasoning, while cleaner reasoning guides the VLM to focus on the right regions. The loop iterates until convergence.

Results & Findings

Benchmark Gains: Valor outperforms the leading open‑source visual reasoning models (e.g., LLaVA, MiniGPT‑4) by 8–12% absolute accuracy on standard spatial reasoning datasets such as CLEVR‑Rel and GQA‑Spatial.
Grounding Improvement: The VLM verifier reduces average IoU error by ~15%, demonstrating that hard‑negative mining can replace manual annotation for training robust object detectors.
Efficiency: Because no human labels are required, the training cost is comparable to fine‑tuning a single model, yet the final system matches or exceeds the performance of multi‑stage pipelines that rely on large labeled corpora.
Generalization: Valor maintains its edge when transferred to out‑of‑distribution queries (e.g., novel object categories or unseen spatial configurations), indicating that the self‑supervised feedback loop learns transferable reasoning patterns.

Practical Implications

Rapid Prototyping: Developers can build visual QA or robotics perception modules without spending weeks on dataset annotation—just feed the system a set of example queries.
Edge Deployment: Since the VLM verifier can be swapped for a lightweight vision model, Valor can be adapted to run on resource‑constrained devices while still benefiting from the LLM’s reasoning.
Improved Human‑AI Interaction: Applications like visual assistants, AR navigation, or inventory management can ask “Where is the nearest fire extinguisher?” and receive answers that are both logically sound and correctly grounded, boosting user trust.
Open‑Source Ecosystem: The authors release code and pretrained checkpoints, enabling the community to extend the framework to other reasoning domains (temporal, causal) or to integrate proprietary LLMs/VLMs as needed.

Limitations & Future Work

Reliance on Strong Pre‑Trained Models: The quality of the final system hinges on the baseline LLM and VLM; weaker models may not benefit as much from the verifier feedback.
Scalability of Hard‑Negative Mining: While annotation‑free, generating and evaluating a large pool of hard negatives can become computationally expensive for very high‑resolution images.
Reasoning Scope: The current focus is on spatial relations; extending the approach to more abstract reasoning (e.g., causality, intent) will require richer verifier designs.
Future Directions: The authors plan to explore multi‑modal verifiers that incorporate audio or depth sensors, and to investigate curriculum‑style training where the difficulty of generated negatives adapts over time.

Authors

Damiano Marsili
Georgia Gkioxari

Paper Information

arXiv ID: 2512.08889v1
Categories: cs.CV, cs.AI
Published: December 9, 2025
PDF: Download PDF

[Paper] No Labels, No Problem: Training Visual Reasoners with Multimodal Verifiers

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Particulate: Feed-Forward 3D Object Articulation

[Paper] Smudged Fingerprints: A Systematic Evaluation of the Robustness of AI Image Fingerprints

[Paper] DentalGPT: Incentivizing Multimodal Complex Reasoning in Dentistry

[Paper] Parallax: Runtime Parallelization for Operator Fallbacks in Heterogeneous Edge Systems