[Paper] AdaTooler-V: Adaptive Tool-Use for Images and Videos
Source: arXiv - 2512.16918v1
Overview
AdaTooler‑V is a new multimodal large language model (MLLM) that learns when to call visual analysis tools (e.g., OCR, object detectors) instead of invoking them indiscriminately. By coupling a reinforcement‑learning‑based reward scheduler with large‑scale curated datasets, the model can decide on‑the‑fly whether a visual question truly needs extra processing, cutting inference cost while boosting accuracy on image‑ and video‑based reasoning tasks.
Key Contributions
- Adaptive tool‑use policy: Introduces AT‑GRPO, a reinforcement‑learning algorithm that scales rewards according to a Tool Benefit Score, encouraging the model to call vision tools only when they add measurable value.
- Two‑stage training data pipeline:
- AdaTooler‑V‑CoT‑100k: a 100 k example chain‑of‑thought (CoT) dataset for supervised fine‑tuning (SFT) that seeds the model with basic visual reasoning patterns.
- AdaTooler‑V‑300k: a 300 k example RL dataset with verified tool‑use outcomes across single‑image, multi‑image, and video scenarios.
- Broad benchmark coverage: Evaluated on 12 diverse visual‑reasoning benchmarks (including high‑resolution V* and video QA), consistently outperforming open‑source and commercial baselines.
- Open‑source release: Model weights (7B), training code, and datasets are publicly available, enabling reproducibility and downstream extensions.
Methodology
- Base MLLM – Starts from a standard language‑only LLM (7 B parameters) and augments it with a tool‑calling interface that can invoke external vision modules (OCR, object detection, frame‑level feature extractors).
- Tool Benefit Score (TBS) – For each training sample, a lightweight heuristic (e.g., improvement in answer confidence when a tool is used) quantifies how much a tool helps.
- AT‑GRPO (Adaptive‑Scale Gradient‑Reward Policy Optimization) – A reinforcement‑learning loop that:
- Computes a reward = base correctness + α·TBS, where α is dynamically adjusted per sample.
- Updates the policy so that high‑TBS samples receive stronger incentives to call the tool, while low‑TBS samples are penalized for unnecessary calls.
- Two‑phase training –
- Supervised fine‑tuning on the CoT‑100k set teaches the model to generate step‑by‑step reasoning and to emit a “use‑tool?” token.
- RL fine‑tuning on the AdaTooler‑V‑300k set refines the decision policy using the AT‑GRPO rewards.
- Inference – At runtime the model predicts a binary “tool‑needed?” flag before any heavy vision processing. If the flag is false, it proceeds with pure language reasoning, saving GPU cycles and latency.
Results & Findings
| Benchmark | AdaTooler‑V‑7B | GPT‑4o | Gemini 1.5 Pro | Avg. Open‑Source |
|---|---|---|---|---|
| V* (high‑res) | 89.8 % | 86.4 % | 87.1 % | 78.3 % |
| Multi‑Image QA | 84.2 % | 80.1 % | 81.5 % | 72.9 % |
| Video QA (AVQA) | 81.7 % | 78.0 % | 79.3 % | 70.4 % |
| Avg. across 12 tasks | 86.5 % | 82.3 % | 83.0 % | 73.1 % |
- Inference efficiency: On average, AdaTooler‑V skips tool calls for ~38 % of queries, reducing GPU memory usage by ~1.2 × and latency by ~30 % compared with a naïve “always‑call‑tool” baseline.
- Robustness: The adaptive policy remains stable across modalities (static images vs. video frames) and scales to higher resolutions without degradation.
Practical Implications
- Cost‑effective AI services – SaaS platforms that expose visual QA (e.g., document processing, visual search) can lower cloud compute bills by avoiding unnecessary OCR or detection calls.
- Edge deployment – On devices with limited compute (mobile, IoT), the model can decide locally whether to offload a heavy vision module to the cloud, optimizing bandwidth and battery life.
- Developer ergonomics – The open‑source tool‑calling API mirrors popular frameworks (LangChain, LlamaIndex), making it straightforward to plug in custom vision modules or replace the default ones with domain‑specific detectors.
- Rapid prototyping – The released CoT‑100k and RL‑300k datasets provide a ready‑made curriculum for fine‑tuning other LLMs on adaptive multimodal reasoning, accelerating research cycles.
Limitations & Future Work
- Tool repertoire limited to pre‑defined vision modules – The current implementation only supports a fixed set of OCR, object detection, and frame‑level feature extractors. Extending to more specialized tools (e.g., medical imaging analysis) will require additional reward‑calibration work.
- Reward estimation relies on heuristics – The Tool Benefit Score is approximated using confidence gains; noisy or biased heuristics could misguide the RL signal in edge cases.
- Scalability to larger LLM backbones – Experiments were limited to a 7 B model; it remains to be seen whether the adaptive policy transfers unchanged to 30 B+ models.
- Real‑time video streams – While the model handles short video clips, continuous streaming scenarios (e.g., live surveillance) need a more sophisticated temporal budgeting strategy.
AdaTooler‑V demonstrates that smarter, context‑aware tool usage can close the performance gap with proprietary giants while keeping inference lean—a promising direction for the next generation of multimodal AI systems.
Authors
- Chaoyang Wang
- Kaituo Feng
- Dongyang Chen
- Zhongyu Wang
- Zhixun Li
- Sicheng Gao
- Meng Meng
- Xu Zhou
- Manyuan Zhang
- Yuzhang Shang
- Xiangyu Yue
Paper Information
- arXiv ID: 2512.16918v1
- Categories: cs.CV
- Published: December 18, 2025
- PDF: Download PDF