[Paper] A prospective clinical feasibility study of a conversational diagnostic AI in an ambulatory primary care clinic

Published: (March 9, 2026 at 10:43 AM EDT)
5 min read
Source: arXiv

Source: arXiv - 2603.08448v1

Overview

A new prospective study puts an LLM‑driven conversational assistant, Articulate Medical Intelligence Explorer (AMIE), into the waiting room of an academic urgent‑care clinic. Over 100 adult patients used a text‑chat interface to collect their history and receive a list of possible diagnoses before seeing their primary‑care provider (PCP). The trial shows that, with real‑time human safety monitoring, the system can operate safely, earn high patient satisfaction, and produce differential diagnoses that closely match those of clinicians.

Key Contributions

  • First real‑world feasibility trial of a large‑language‑model (LLM) diagnostic chatbot integrated into an ambulatory primary‑care workflow.
  • Safety supervision framework that monitors every patient‑AI interaction in real time, with zero forced terminations.
  • Quantitative performance metrics: 90 % inclusion of the eventual diagnosis in AMIE’s differential, 75 % top‑3 accuracy, and comparable overall diagnostic quality to PCPs.
  • Positive user experience: patients reported high satisfaction and a statistically significant increase in trust toward AI after the chat.
  • Clinician impact: PCPs found the AI’s output useful for pre‑visit preparation, though they rated the AI lower on practicality and cost‑effectiveness of management plans.

Methodology

  1. Study Design – Prospective, single‑arm feasibility study at a leading academic medical center’s urgent‑care clinic.
  2. Participants – 100 adult patients scheduled for an in‑person appointment; each completed a text‑based chat with AMIE up to five days before the visit.
  3. AI System – AMIE is built on a state‑of‑the‑art LLM fine‑tuned for medical dialogue, capable of eliciting history, suggesting a differential diagnosis (DDx), and proposing management (Mx) steps.
  4. Safety Oversight – Dedicated human supervisors watched all chats live, ready to intervene if predefined safety triggers (e.g., unsafe advice, missed red‑flags) occurred. No interventions were needed.
  5. Evaluation
    • Safety & Quality – Post‑chat surveys, chart review 8 weeks later, and blinded expert rating of DDx and Mx plans.
    • User Experience – Patient satisfaction questionnaires (pre‑ and post‑interaction) and clinician feedback on usefulness.
    • Performance Metrics – Inclusion of final diagnosis in AI DDx, top‑k accuracy, and statistical comparison to PCP‑generated DDx/Mx.

Results & Findings

MetricAMIEPCP (reference)
Final diagnosis appears in DDx90 %
Top‑3 DDx accuracy75 %
Overall DDx quality (blinded rating)No significant difference (p = 0.6)
Management plan safety & appropriatenessNo significant difference (p = 0.1, 1.0)
Practicality of Mx planPCP superior (p = 0.003)
Cost‑effectiveness of Mx planPCP superior (p = 0.004)
Patient satisfaction (post‑chat)High (significant uplift in AI attitude, p < 0.001)
Clinician perceived usefulnessPositive impact on preparedness

Interpretation: AMIE can reliably generate a clinically relevant differential diagnosis and safe management suggestions comparable to a human PCP, while being well‑received by patients. Clinicians still view the AI‑generated management plans as less practical and cost‑effective, indicating room for refinement.

Practical Implications

  • Pre‑visit triage augmentation – Clinics could deploy a chatbot to collect structured histories, freeing up clinician time for physical exams and complex decision‑making.
  • Decision‑support for busy providers – The AI’s DDx list can serve as a “second opinion,” helping clinicians consider less obvious conditions early.
  • Patient engagement & education – Interactive symptom collection may improve health literacy and patient confidence, especially for tech‑savvy populations.
  • Safety‑by‑design workflow – The study’s real‑time supervision model offers a blueprint for regulatory‑compliant AI rollouts in healthcare settings.
  • Potential cost savings – Automating routine history taking could reduce administrative overhead, though the current management recommendations need further optimization to realize full economic benefit.

Limitations & Future Work

  • Single‑site, single‑arm design limits generalizability; multi‑center trials are needed.
  • Scope of interaction was limited to text chat; voice or multimodal interfaces may affect usability.
  • Management plan practicality lagged behind PCPs, suggesting the AI needs better integration of resource constraints and patient preferences.
  • Long‑term outcomes (e.g., diagnostic error rates, downstream healthcare utilization) were not measured.
  • Future research should explore continuous learning pipelines, integration with electronic health records, and robust post‑deployment monitoring to ensure safety at scale.

Authors

  • Peter Brodeur
  • Jacob M. Koshy
  • Anil Palepu
  • Khaled Saab
  • Ava Homiar
  • Roma Ruparel
  • Charles Wu
  • Ryutaro Tanno
  • Joseph Xu
  • Amy Wang
  • David Stutz
  • Hannah M. Ferrera
  • David Barrett
  • Lindsey Crowley
  • Jihyeon Lee
  • Spencer E. Rittner
  • Ellery Wulczyn
  • Selena K. Zhang
  • Elahe Vedadi
  • Christine G. Kohn
  • Kavita Kulkarni
  • Vinay Kadiyala
  • Sara Mahdavi
  • Wendy Du
  • Jessica Williams
  • David Feinbloom
  • Renee Wong
  • Tao Tu
  • Petar Sirkovic
  • Alessio Orlandi
  • Christopher Semturs
  • Yun Liu
  • Juraj Gottweis
  • Dale R. Webster
  • Joëlle Barral
  • Katherine Chou
  • Pushmeet Kohli
  • Avinatan Hassidim
  • Yossi Matias
  • James Manyika
  • Rob Fields
  • Jonathan X. Li
  • Marc L. Cohen
  • Vivek Natarajan
  • Mike Schaekermann
  • Alan Karthikesalingam
  • Adam Rodman

Paper Information

  • arXiv ID: 2603.08448v1
  • Categories: cs.HC, cs.AI, cs.CL, cs.LG
  • Published: March 9, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] Agentic Critical Training

Training large language models (LLMs) as autonomous agents often begins with imitation learning, but it only teaches agents what to do without understanding why...