[Paper] ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning

Published: 2 months ago (December 4, 2025 at 01:59 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.05111v1

Overview

The paper introduces ARM‑Thinker, a new kind of multimodal reward model that can actively use external tools (like image‑cropping utilities or document‑search APIs) to verify its own judgments. By turning reward scoring from a static, “black‑box” operation into an interactive, evidence‑driven process, the authors dramatically improve visual grounding, reduce hallucinations, and boost performance on complex vision‑language tasks.

Key Contributions

Agentic Reward Modeling – First reward model that autonomously decides when and which external tools to call during evaluation.
Tool‑Integrated Training Pipeline – Multi‑stage reinforcement learning that jointly optimizes tool‑selection policies and reward accuracy.
ARMBench‑VL Suite – New benchmark covering fine‑grained visual grounding, multi‑page document reasoning, and instruction‑following verification.
Significant Performance Gains – +16.2 % average improvement on standard reward‑model benchmarks and +9.6 % on tool‑use tasks; state‑of‑the‑art results on multimodal math and logical reasoning datasets.
Interpretability Boost – The model produces explicit tool‑call logs, giving developers a traceable “why” behind each reward score.

Methodology

Agentic Architecture – ARM‑Thinker consists of a vision‑language encoder paired with a tool controller. Given an input (e.g., an image + question), the controller predicts whether a tool is needed and which one to invoke.
Tool Set – The authors integrate lightweight utilities such as:
- Image cropping / zoom for inspecting small regions.
- Document page retrieval for multi‑page PDFs or scanned books.
- Textual verification APIs (e.g., spell‑check, fact‑check).
Reinforcement Learning Loop – Training proceeds in three stages:
- Supervised pre‑training on human‑annotated reward scores.
- Tool‑policy fine‑tuning where the model learns to call tools that maximize a downstream reward (e.g., correct answer verification).
- Joint RL that updates both the reward‑scoring head and the tool‑selection policy using a reward signal that penalizes unnecessary tool calls and rewards correct evidence‑based judgments.
Evaluation Protocol – For each benchmark item, ARM‑Thinker outputs a reward score and a tool‑call trace, which is then compared against ground‑truth evidence to compute accuracy and interpretability metrics.

Results & Findings

Benchmark	Baseline (static RM)	ARM‑Thinker	Δ Improvement
Fine‑grained visual grounding (image‑tool)	68.4 %	84.6 %	+16.2 %
Multi‑page document reasoning (retrieval‑tool)	71.1 %	80.7 %	+9.6 %
Instruction‑following verification (text‑tool)	73.5 %	79.2 %	+5.7 %
Multimodal math & logic (MM‑Math)	61.3 %	70.8 %	+9.5 %

Tool usage is selective: on average the model calls a tool for only 27 % of inputs, showing it learns to invoke tools only when needed.
Interpretability: The tool‑call logs align with human reasoning in 84 % of cases, offering a clear audit trail.
Robustness: When visual noise or ambiguous phrasing is introduced, ARM‑Thinker’s performance degrades far less than static reward models, confirming the benefit of on‑the‑fly verification.

Practical Implications

More Reliable Vision‑Language APIs – Deploying ARM‑Thinker as a scoring layer can catch hallucinations before they reach end‑users, especially in high‑stakes domains like medical imaging or legal document analysis.
Plug‑and‑Play Tool Integration – Developers can extend the tool library (e.g., OCR, GIS lookup) without retraining the entire model; the RL controller learns to incorporate new utilities with minimal data.
Audit‑Ready AI Systems – The explicit tool‑call trace satisfies compliance requirements for explainability, making it easier to certify AI services for regulated industries.
Cost‑Effective Scaling – Because the model only calls expensive tools when necessary, inference budgets stay low while still achieving high accuracy on difficult cases.
Foundation for Agentic LLMs – The architecture demonstrates a practical pathway to embed tool‑use capabilities directly into reward models, paving the way for more autonomous multimodal assistants.

Limitations & Future Work

Tool Dependency – Performance hinges on the quality and availability of external tools; missing or poorly performing utilities can bottleneck the system.
Training Complexity – Multi‑stage RL adds engineering overhead and requires careful tuning of the trade‑off between tool usage cost and reward gain.
Generalization to Unseen Tools – While the controller can learn to select among known tools, extending to completely novel tool types still requires additional fine‑tuning.
Scalability of Evidence Logs – For large‑scale deployments, storing and processing detailed tool‑call traces may become storage‑intensive.

Future research directions include: expanding the tool repertoire (e.g., 3‑D model viewers, real‑time sensor feeds), investigating meta‑learning approaches for rapid adaptation to new tools, and integrating cost‑aware scheduling to further optimize inference budgets.

Authors

Shengyuan Ding
Xinyu Fang
Ziyu Liu
Yuhang Zang
Yuhang Cao
Xiangyu Zhao
Haodong Duan
Xiaoyi Dong
Jianze Liang
Bin Wang
Conghui He
Dahua Lin
Jiaqi Wang

Paper Information

arXiv ID: 2512.05111v1
Categories: cs.CV
Published: December 4, 2025
PDF: Download PDF

[Paper] ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] EditThinker: Unlocking Iterative Reasoning for Any Image Editor

[Paper] AQUA-Net: Adaptive Frequency Fusion and Illumination Aware Network for Underwater Image Enhancement

[Paper] M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG

[Paper] SIMPACT: Simulation-Enabled Action Planning using Vision-Language Models