[Paper] MedAI: Evaluating TxAgent's Therapeutic Agentic Reasoning in the NeurIPS CURE-Bench Competition
Source: arXiv - 2512.11682v1
Overview
The paper “MedAI: Evaluating TxAgent’s Therapeutic Agentic Reasoning in the NeurIPS CURE‑Bench Competition” describes how the authors built and rigorously tested TxAgent, an AI system that can reason step‑by‑step about medical treatment decisions. By coupling a fine‑tuned Llama‑3.1‑8B model with a suite of live biomedical tools (FDA Drug API, OpenTargets, Monarch), TxAgent demonstrates that agentic AI—models that can call external functions on the fly—can meet the high safety and accuracy demands of clinical decision support.
Key Contributions
- Agentic RAG architecture: Introduced TxAgent, which generates and executes function calls to a unified “ToolUniverse” for up‑to‑date therapeutic data.
- Fine‑tuned Llama‑3.1‑8B: Adapted a compact 8‑billion‑parameter model for multi‑step medical reasoning, keeping inference costs manageable for real‑world deployment.
- Novel evaluation protocol: Treated token‑level reasoning traces and tool‑usage sequences as explicit supervision signals, enabling fine‑grained metrics for correctness, tool selection, and reasoning quality.
- Retrieval‑quality analysis: Showed that the precision of tool‑retrieval (i.e., picking the right API call) directly correlates with overall task performance, and proposed a lightweight retrieval‑enhancement that boosted scores on the CURE‑Bench leaderboard.
- Open‑science award: Earned the Excellence Award at NeurIPS 2025, and released code, data, and evaluation scripts for community reuse.
Methodology
- Prompt‑driven agentic loop – The model receives a clinical query (e.g., “Suggest a regimen for a patient with hypertension and chronic kidney disease”). It first generates a textual plan, then decides which external tool to call (e.g., “search FDA Drug API for ACE inhibitors”).
- ToolUniverse – A thin abstraction layer that normalizes three public biomedical services:
- FDA Drug API for approved indications, dosage, and contraindications.
- OpenTargets for disease‑gene‑drug associations.
- Monarch for phenotype‑gene‑disease ontologies.
The agent sends a JSON‑formatted request, receives structured results, and feeds them back into the next reasoning step.
- Fine‑tuning – The base Llama‑3.1‑8B model was further trained on a curated corpus of 200 k synthetic doctor‑patient dialogues, each annotated with the correct sequence of tool calls. This supervision teaches the model when and how to invoke tools rather than just generating text.
- Evaluation on CURE‑Bench – The competition provides three benchmark tasks (drug recommendation, treatment planning, adverse‑effect prediction). The authors measured:
- Exact‑match correctness of the final answer.
- Tool‑usage accuracy (did the model call the right API at the right time?).
- Reasoning trace quality (alignment of intermediate steps with a gold‑standard chain‑of‑thought).
Results & Findings
| Task | Exact‑match ↑ | Tool‑usage ↑ | Reasoning‑trace F1 ↑ |
|---|---|---|---|
| Drug Recommendation | 78.4 % | 92.1 % | 0.84 |
| Treatment Planning | 71.2 % | 89.5 % | 0.81 |
| Adverse‑Effect Prediction | 74.6 % | 90.3 % | 0.83 |
- Retrieval boost: Adding a lightweight BM25 pre‑filter before calling the APIs raised tool‑usage accuracy by ~3 pts and overall exact‑match scores by 4–5 pts.
- Error analysis: Most failures stemmed from incorrect tool sequencing (e.g., querying a drug database before confirming the disease indication). When the tool order matched the gold trace, correctness jumped >10 pts.
- Compute efficiency: Despite the iterative calls, average latency per query stayed under 1.2 seconds on an RTX 4090, making the system viable for interactive clinical decision support.
Practical Implications
- Clinical decision support (CDS) integration – TxAgent’s modular tool calls can be wrapped into existing EHR workflows, providing up‑to‑date drug information without hard‑coding static knowledge bases.
- Regulatory‑ready AI – By exposing every reasoning step and tool invocation, auditors can trace how a recommendation was derived, satisfying emerging AI‑in‑healthcare governance frameworks.
- Developer-friendly SDK – The open‑source
tooluniversePython package abstracts API keys, rate‑limiting, and response parsing, letting developers plug TxAgent into telemedicine bots, pharmacy automation, or research pipelines with a few lines of code. - Scalable to other domains – The same agentic pattern (LLM + function calls + retrieval‑enhanced selection) can be repurposed for finance (regulatory compliance), cybersecurity (threat intel lookup), or any high‑stakes field where up‑to‑date external data is essential.
Limitations & Future Work
- Scope of knowledge bases – TxAgent currently relies on three public APIs; coverage gaps (e.g., rare orphan drugs) can lead to incomplete recommendations.
- Hallucination risk in intermediate reasoning – Although tool calls ground the final answer, the model sometimes generates plausible but incorrect rationales before the API response arrives.
- Evaluation bias – CURE‑Bench uses synthetic patient cases; real‑world clinical validation (prospective trials, clinician usability studies) is still pending.
- Future directions the authors outline: expanding ToolUniverse to include pharmacogenomics databases, integrating a reinforcement‑learning loop that rewards correct tool sequencing, and conducting a multi‑center clinical pilot to measure impact on prescribing safety.
Authors
- Tim Cofala
- Christian Kalfar
- Jingge Xiao
- Johanna Schrader
- Michelle Tang
- Wolfgang Nejdl
Paper Information
- arXiv ID: 2512.11682v1
- Categories: cs.AI, cs.LG
- Published: December 12, 2025
- PDF: Download PDF