[Paper] MedAI: Evaluating TxAgent's Therapeutic Agentic Reasoning in the NeurIPS CURE-Bench Competition

Published: 1 month ago (December 12, 2025 at 11:01 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.11682v1

Overview

The paper “MedAI: Evaluating TxAgent’s Therapeutic Agentic Reasoning in the NeurIPS CURE‑Bench Competition” describes how the authors built and rigorously tested TxAgent, an AI system that can reason step‑by‑step about medical treatment decisions. By coupling a fine‑tuned Llama‑3.1‑8B model with a suite of live biomedical tools (FDA Drug API, OpenTargets, Monarch), TxAgent demonstrates that agentic AI—models that can call external functions on the fly—can meet the high safety and accuracy demands of clinical decision support.

Key Contributions

Agentic RAG architecture: Introduced TxAgent, which generates and executes function calls to a unified “ToolUniverse” for up‑to‑date therapeutic data.
Fine‑tuned Llama‑3.1‑8B: Adapted a compact 8‑billion‑parameter model for multi‑step medical reasoning, keeping inference costs manageable for real‑world deployment.
Novel evaluation protocol: Treated token‑level reasoning traces and tool‑usage sequences as explicit supervision signals, enabling fine‑grained metrics for correctness, tool selection, and reasoning quality.
Retrieval‑quality analysis: Showed that the precision of tool‑retrieval (i.e., picking the right API call) directly correlates with overall task performance, and proposed a lightweight retrieval‑enhancement that boosted scores on the CURE‑Bench leaderboard.
Open‑science award: Earned the Excellence Award at NeurIPS 2025, and released code, data, and evaluation scripts for community reuse.

Methodology

Prompt‑driven agentic loop – The model receives a clinical query (e.g., “Suggest a regimen for a patient with hypertension and chronic kidney disease”). It first generates a textual plan, then decides which external tool to call (e.g., “search FDA Drug API for ACE inhibitors”).
ToolUniverse – A thin abstraction layer that normalizes three public biomedical services:
- FDA Drug API for approved indications, dosage, and contraindications.
- OpenTargets for disease‑gene‑drug associations.
- Monarch for phenotype‑gene‑disease ontologies.
  The agent sends a JSON‑formatted request, receives structured results, and feeds them back into the next reasoning step.
Fine‑tuning – The base Llama‑3.1‑8B model was further trained on a curated corpus of 200 k synthetic doctor‑patient dialogues, each annotated with the correct sequence of tool calls. This supervision teaches the model when and how to invoke tools rather than just generating text.
Evaluation on CURE‑Bench – The competition provides three benchmark tasks (drug recommendation, treatment planning, adverse‑effect prediction). The authors measured:
- Exact‑match correctness of the final answer.
- Tool‑usage accuracy (did the model call the right API at the right time?).
- Reasoning trace quality (alignment of intermediate steps with a gold‑standard chain‑of‑thought).

Results & Findings

Task	Exact‑match ↑	Tool‑usage ↑	Reasoning‑trace F1 ↑
Drug Recommendation	78.4 %	92.1 %	0.84
Treatment Planning	71.2 %	89.5 %	0.81
Adverse‑Effect Prediction	74.6 %	90.3 %	0.83

Retrieval boost: Adding a lightweight BM25 pre‑filter before calling the APIs raised tool‑usage accuracy by ~3 pts and overall exact‑match scores by 4–5 pts.
Error analysis: Most failures stemmed from incorrect tool sequencing (e.g., querying a drug database before confirming the disease indication). When the tool order matched the gold trace, correctness jumped >10 pts.
Compute efficiency: Despite the iterative calls, average latency per query stayed under 1.2 seconds on an RTX 4090, making the system viable for interactive clinical decision support.

Practical Implications

Clinical decision support (CDS) integration – TxAgent’s modular tool calls can be wrapped into existing EHR workflows, providing up‑to‑date drug information without hard‑coding static knowledge bases.
Regulatory‑ready AI – By exposing every reasoning step and tool invocation, auditors can trace how a recommendation was derived, satisfying emerging AI‑in‑healthcare governance frameworks.
Developer-friendly SDK – The open‑source tooluniverse Python package abstracts API keys, rate‑limiting, and response parsing, letting developers plug TxAgent into telemedicine bots, pharmacy automation, or research pipelines with a few lines of code.
Scalable to other domains – The same agentic pattern (LLM + function calls + retrieval‑enhanced selection) can be repurposed for finance (regulatory compliance), cybersecurity (threat intel lookup), or any high‑stakes field where up‑to‑date external data is essential.

Limitations & Future Work

Scope of knowledge bases – TxAgent currently relies on three public APIs; coverage gaps (e.g., rare orphan drugs) can lead to incomplete recommendations.
Hallucination risk in intermediate reasoning – Although tool calls ground the final answer, the model sometimes generates plausible but incorrect rationales before the API response arrives.
Evaluation bias – CURE‑Bench uses synthetic patient cases; real‑world clinical validation (prospective trials, clinician usability studies) is still pending.
Future directions the authors outline: expanding ToolUniverse to include pharmacogenomics databases, integrating a reinforcement‑learning loop that rewards correct tool sequencing, and conducting a multi‑center clinical pilot to measure impact on prescribing safety.

Authors

Tim Cofala
Christian Kalfar
Jingge Xiao
Johanna Schrader
Michelle Tang
Wolfgang Nejdl

Paper Information

arXiv ID: 2512.11682v1
Categories: cs.AI, cs.LG
Published: December 12, 2025
PDF: Download PDF

[Paper] MedAI: Evaluating TxAgent's Therapeutic Agentic Reasoning in the NeurIPS CURE-Bench Competition

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Particulate: Feed-Forward 3D Object Articulation

[Paper] A General Algorithm for Detecting Higher-Order Interactions via Random Sequential Additions

[Paper] Softmax as Linear Attention in the Large-Prompt Regime: a Measure-based Perspective

[Paper] Super Suffixes: Bypassing Text Generation Alignment and Guard Models Simultaneously