[Paper] Picking the Right Specialist: Attentive Neural Process-based Selection of Task-Specialized Models as Tools for Agentic Healthcare Systems

Published: (February 16, 2026 at 11:36 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.14901v1

Overview

The paper introduces ToolSelect, a learning‑based system that lets an AI “agent” pick the most suitable specialist model (or “tool”) for a given clinical query. By treating model selection as a learned task and leveraging an Attentive Neural Process, the authors show how to automatically route each request to the specialist that will perform best—crucial for complex, multi‑task healthcare AI that must juggle diagnosis, image localization, report generation, and visual‑question‑answering.

Key Contributions

  • ToolSelect framework: a novel selector that conditions on both the input query and concise behavioral summaries of each candidate model, using an Attentive Neural Process to predict the optimal tool.
  • Consistent surrogate loss: formulation of a population‑risk minimization objective that approximates the true task‑conditional selection loss, enabling stable training.
  • First agentic chest‑X‑ray testbed: a comprehensive environment containing 55 heterogeneous specialist models (disease detection, report generation, visual grounding, VQA).
  • ToolSelectBench: a benchmark of 1,448 realistic clinical queries spanning four task families, with ground‑truth “best‑tool” labels.
  • Empirical superiority: ToolSelect outperforms ten state‑of‑the‑art baselines (including ensemble methods, meta‑learners, and reinforcement‑learning selectors) across all tasks.

Methodology

  1. Tool pool & summaries: Each specialist model is pre‑trained on a specific task (e.g., detecting pneumonia, generating radiology reports). For every model, a lightweight “behavioral summary” is computed—statistics such as confidence distributions, past performance on similar inputs, and feature embeddings.
  2. Attentive Neural Process (ANP) selector:
    • Context: The query (e.g., a chest X‑ray image plus a textual prompt) is encoded with a CNN‑+‑Transformer backbone.
    • Target: The set of model summaries acts as target points.
    • Attention: The ANP attends to the most relevant summaries given the query, producing a distribution over tools.
  3. Training objective: The selector is trained to minimize a surrogate loss that approximates the expected task loss if the chosen tool were used. This surrogate is consistent—optimizing it provably drives the selector toward the true optimal tool selection policy.
  4. Evaluation pipeline: On the new Chest X‑ray environment, each query is passed through ToolSelect, which selects a tool; the chosen tool’s output is then scored against the ground‑truth answer.

Results & Findings

Task FamilyBaseline Avg. AccuracyToolSelect Accuracy
Disease Detection (17 models)71.2 %78.9 %
Report Generation (19 models)62.5 %70.3 %
Visual Grounding (6 models)68.0 %75.4 %
VQA (13 models)64.1 %71.8 %
  • ToolSelect consistently beats the strongest baseline by 6–9 percentage points across all families.
  • Ablation studies show that removing the attention mechanism or the behavioral summaries drops performance by ~4 pp, confirming their importance.
  • The selector remains lightweight (≈ 2 M parameters) and adds < 15 ms latency per query, making it viable for real‑time clinical pipelines.

Practical Implications

  • Dynamic tool orchestration: Healthcare AI platforms can now automatically delegate each request to the model that is empirically best for that specific case, improving diagnostic accuracy without manual model management.
  • Scalable multi‑task systems: As new specialist models (e.g., for emerging diseases) are added, ToolSelect can incorporate them simply by generating their summaries—no retraining of the entire system is required.
  • Reduced inference cost: By selecting a single optimal tool rather than running an ensemble of all models, computational load and cloud costs drop dramatically.
  • Regulatory compliance: Transparent selection logic (the attention weights over model summaries) can be logged for audit trails, helping meet medical AI governance standards.
  • Developer workflow: Engineers can plug any PyTorch/TensorFlow model into the pool, expose its summary API, and immediately benefit from the selector, accelerating prototyping of agentic health assistants.

Limitations & Future Work

  • Dependence on summary quality: The selector’s performance hinges on informative behavioral summaries; poorly calibrated summaries can mislead the attention mechanism.
  • Static pool assumption: The current setup assumes a fixed set of specialist models during training; handling truly online addition/removal of tools remains an open challenge.
  • Domain specificity: Benchmarks are limited to chest X‑ray tasks; extending to other imaging modalities (CT, MRI) or non‑visual data (EHR notes) will test generality.
  • Explainability: While attention weights provide some insight, deeper interpretability of why a particular tool was chosen is still needed for high‑stakes clinical decisions.

Overall, ToolSelect offers a practical, data‑driven solution for orchestrating heterogeneous AI specialists in agentic healthcare systems, paving the way for more reliable and efficient clinical AI assistants.

Authors

  • Pramit Saha
  • Joshua Strong
  • Mohammad Alsharid
  • Divyanshu Mishra
  • J. Alison Noble

Paper Information

  • arXiv ID: 2602.14901v1
  • Categories: cs.LG, cs.AI, cs.CV, cs.MA
  • Published: February 16, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »