[Paper] ARM-Thinker: Reinforcing Multimodal Generative Reward Models with Agentic Tool Use and Visual Reasoning

Published: (December 4, 2025 at 01:59 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.05111v1

Overview

The paper introduces ARM‑Thinker, a new kind of multimodal reward model that can actively use external tools (like image‑cropping utilities or document‑search APIs) to verify its own judgments. By turning reward scoring from a static, “black‑box” operation into an interactive, evidence‑driven process, the authors dramatically improve visual grounding, reduce hallucinations, and boost performance on complex vision‑language tasks.

Key Contributions

  • Agentic Reward Modeling – First reward model that autonomously decides when and which external tools to call during evaluation.
  • Tool‑Integrated Training Pipeline – Multi‑stage reinforcement learning that jointly optimizes tool‑selection policies and reward accuracy.
  • ARMBench‑VL Suite – New benchmark covering fine‑grained visual grounding, multi‑page document reasoning, and instruction‑following verification.
  • Significant Performance Gains – +16.2 % average improvement on standard reward‑model benchmarks and +9.6 % on tool‑use tasks; state‑of‑the‑art results on multimodal math and logical reasoning datasets.
  • Interpretability Boost – The model produces explicit tool‑call logs, giving developers a traceable “why” behind each reward score.

Methodology

  1. Agentic Architecture – ARM‑Thinker consists of a vision‑language encoder paired with a tool controller. Given an input (e.g., an image + question), the controller predicts whether a tool is needed and which one to invoke.
  2. Tool Set – The authors integrate lightweight utilities such as:
    • Image cropping / zoom for inspecting small regions.
    • Document page retrieval for multi‑page PDFs or scanned books.
    • Textual verification APIs (e.g., spell‑check, fact‑check).
  3. Reinforcement Learning Loop – Training proceeds in three stages:
    • Supervised pre‑training on human‑annotated reward scores.
    • Tool‑policy fine‑tuning where the model learns to call tools that maximize a downstream reward (e.g., correct answer verification).
    • Joint RL that updates both the reward‑scoring head and the tool‑selection policy using a reward signal that penalizes unnecessary tool calls and rewards correct evidence‑based judgments.
  4. Evaluation Protocol – For each benchmark item, ARM‑Thinker outputs a reward score and a tool‑call trace, which is then compared against ground‑truth evidence to compute accuracy and interpretability metrics.

Results & Findings

BenchmarkBaseline (static RM)ARM‑ThinkerΔ Improvement
Fine‑grained visual grounding (image‑tool)68.4 %84.6 %+16.2 %
Multi‑page document reasoning (retrieval‑tool)71.1 %80.7 %+9.6 %
Instruction‑following verification (text‑tool)73.5 %79.2 %+5.7 %
Multimodal math & logic (MM‑Math)61.3 %70.8 %+9.5 %
  • Tool usage is selective: on average the model calls a tool for only 27 % of inputs, showing it learns to invoke tools only when needed.
  • Interpretability: The tool‑call logs align with human reasoning in 84 % of cases, offering a clear audit trail.
  • Robustness: When visual noise or ambiguous phrasing is introduced, ARM‑Thinker’s performance degrades far less than static reward models, confirming the benefit of on‑the‑fly verification.

Practical Implications

  • More Reliable Vision‑Language APIs – Deploying ARM‑Thinker as a scoring layer can catch hallucinations before they reach end‑users, especially in high‑stakes domains like medical imaging or legal document analysis.
  • Plug‑and‑Play Tool Integration – Developers can extend the tool library (e.g., OCR, GIS lookup) without retraining the entire model; the RL controller learns to incorporate new utilities with minimal data.
  • Audit‑Ready AI Systems – The explicit tool‑call trace satisfies compliance requirements for explainability, making it easier to certify AI services for regulated industries.
  • Cost‑Effective Scaling – Because the model only calls expensive tools when necessary, inference budgets stay low while still achieving high accuracy on difficult cases.
  • Foundation for Agentic LLMs – The architecture demonstrates a practical pathway to embed tool‑use capabilities directly into reward models, paving the way for more autonomous multimodal assistants.

Limitations & Future Work

  • Tool Dependency – Performance hinges on the quality and availability of external tools; missing or poorly performing utilities can bottleneck the system.
  • Training Complexity – Multi‑stage RL adds engineering overhead and requires careful tuning of the trade‑off between tool usage cost and reward gain.
  • Generalization to Unseen Tools – While the controller can learn to select among known tools, extending to completely novel tool types still requires additional fine‑tuning.
  • Scalability of Evidence Logs – For large‑scale deployments, storing and processing detailed tool‑call traces may become storage‑intensive.

Future research directions include: expanding the tool repertoire (e.g., 3‑D model viewers, real‑time sensor feeds), investigating meta‑learning approaches for rapid adaptation to new tools, and integrating cost‑aware scheduling to further optimize inference budgets.

Authors

  • Shengyuan Ding
  • Xinyu Fang
  • Ziyu Liu
  • Yuhang Zang
  • Yuhang Cao
  • Xiangyu Zhao
  • Haodong Duan
  • Xiaoyi Dong
  • Jianze Liang
  • Bin Wang
  • Conghui He
  • Dahua Lin
  • Jiaqi Wang

Paper Information

  • arXiv ID: 2512.05111v1
  • Categories: cs.CV
  • Published: December 4, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »