[Paper] AdaTooler-V: Adaptive Tool-Use for Images and Videos

Published: 1 month ago (December 18, 2025 at 01:59 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.16918v1

Overview

AdaTooler‑V is a new multimodal large language model (MLLM) that learns when to call visual analysis tools (e.g., OCR, object detectors) instead of invoking them indiscriminately. By coupling a reinforcement‑learning‑based reward scheduler with large‑scale curated datasets, the model can decide on‑the‑fly whether a visual question truly needs extra processing, cutting inference cost while boosting accuracy on image‑ and video‑based reasoning tasks.

Key Contributions

Adaptive tool‑use policy: Introduces AT‑GRPO, a reinforcement‑learning algorithm that scales rewards according to a Tool Benefit Score, encouraging the model to call vision tools only when they add measurable value.
Two‑stage training data pipeline:
- AdaTooler‑V‑CoT‑100k: a 100 k example chain‑of‑thought (CoT) dataset for supervised fine‑tuning (SFT) that seeds the model with basic visual reasoning patterns.
- AdaTooler‑V‑300k: a 300 k example RL dataset with verified tool‑use outcomes across single‑image, multi‑image, and video scenarios.
Broad benchmark coverage: Evaluated on 12 diverse visual‑reasoning benchmarks (including high‑resolution V* and video QA), consistently outperforming open‑source and commercial baselines.
Open‑source release: Model weights (7B), training code, and datasets are publicly available, enabling reproducibility and downstream extensions.

Methodology

Base MLLM – Starts from a standard language‑only LLM (7 B parameters) and augments it with a tool‑calling interface that can invoke external vision modules (OCR, object detection, frame‑level feature extractors).
Tool Benefit Score (TBS) – For each training sample, a lightweight heuristic (e.g., improvement in answer confidence when a tool is used) quantifies how much a tool helps.
AT‑GRPO (Adaptive‑Scale Gradient‑Reward Policy Optimization) – A reinforcement‑learning loop that:
- Computes a reward = base correctness + α·TBS, where α is dynamically adjusted per sample.
- Updates the policy so that high‑TBS samples receive stronger incentives to call the tool, while low‑TBS samples are penalized for unnecessary calls.
Two‑phase training –
- Supervised fine‑tuning on the CoT‑100k set teaches the model to generate step‑by‑step reasoning and to emit a “use‑tool?” token.
- RL fine‑tuning on the AdaTooler‑V‑300k set refines the decision policy using the AT‑GRPO rewards.
Inference – At runtime the model predicts a binary “tool‑needed?” flag before any heavy vision processing. If the flag is false, it proceeds with pure language reasoning, saving GPU cycles and latency.

Results & Findings

Benchmark	AdaTooler‑V‑7B	GPT‑4o	Gemini 1.5 Pro	Avg. Open‑Source
V* (high‑res)	89.8 %	86.4 %	87.1 %	78.3 %
Multi‑Image QA	84.2 %	80.1 %	81.5 %	72.9 %
Video QA (AVQA)	81.7 %	78.0 %	79.3 %	70.4 %
Avg. across 12 tasks	86.5 %	82.3 %	83.0 %	73.1 %

Inference efficiency: On average, AdaTooler‑V skips tool calls for ~38 % of queries, reducing GPU memory usage by ~1.2 × and latency by ~30 % compared with a naïve “always‑call‑tool” baseline.
Robustness: The adaptive policy remains stable across modalities (static images vs. video frames) and scales to higher resolutions without degradation.

Practical Implications

Cost‑effective AI services – SaaS platforms that expose visual QA (e.g., document processing, visual search) can lower cloud compute bills by avoiding unnecessary OCR or detection calls.
Edge deployment – On devices with limited compute (mobile, IoT), the model can decide locally whether to offload a heavy vision module to the cloud, optimizing bandwidth and battery life.
Developer ergonomics – The open‑source tool‑calling API mirrors popular frameworks (LangChain, LlamaIndex), making it straightforward to plug in custom vision modules or replace the default ones with domain‑specific detectors.
Rapid prototyping – The released CoT‑100k and RL‑300k datasets provide a ready‑made curriculum for fine‑tuning other LLMs on adaptive multimodal reasoning, accelerating research cycles.

Limitations & Future Work

Tool repertoire limited to pre‑defined vision modules – The current implementation only supports a fixed set of OCR, object detection, and frame‑level feature extractors. Extending to more specialized tools (e.g., medical imaging analysis) will require additional reward‑calibration work.
Reward estimation relies on heuristics – The Tool Benefit Score is approximated using confidence gains; noisy or biased heuristics could misguide the RL signal in edge cases.
Scalability to larger LLM backbones – Experiments were limited to a 7 B model; it remains to be seen whether the adaptive policy transfers unchanged to 30 B+ models.
Real‑time video streams – While the model handles short video clips, continuous streaming scenarios (e.g., live surveillance) need a more sophisticated temporal budgeting strategy.

AdaTooler‑V demonstrates that smarter, context‑aware tool usage can close the performance gap with proprietary giants while keeping inference lean—a promising direction for the next generation of multimodal AI systems.

Authors

Chaoyang Wang
Kaituo Feng
Dongyang Chen
Zhongyu Wang
Zhixun Li
Sicheng Gao
Meng Meng
Xu Zhou
Manyuan Zhang
Yuzhang Shang
Xiangyu Yue

Paper Information

arXiv ID: 2512.16918v1
Categories: cs.CV
Published: December 18, 2025
PDF: Download PDF

[Paper] AdaTooler-V: Adaptive Tool-Use for Images and Videos

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Both Semantics and Reconstruction Matter: Making Representation Encoders Ready for Text-to-Image Generation and Editing

[Paper] Re-Depth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting

[Paper] Dexterous World Models

[Paper] Adversarial Robustness of Vision in Open Foundation Models