[Paper] A prospective clinical feasibility study of a conversational diagnostic AI in an ambulatory primary care clinic
Source: arXiv - 2603.08448v1
Overview
A new prospective study puts an LLM‑driven conversational assistant, Articulate Medical Intelligence Explorer (AMIE), into the waiting room of an academic urgent‑care clinic. Over 100 adult patients used a text‑chat interface to collect their history and receive a list of possible diagnoses before seeing their primary‑care provider (PCP). The trial shows that, with real‑time human safety monitoring, the system can operate safely, earn high patient satisfaction, and produce differential diagnoses that closely match those of clinicians.
Key Contributions
- First real‑world feasibility trial of a large‑language‑model (LLM) diagnostic chatbot integrated into an ambulatory primary‑care workflow.
- Safety supervision framework that monitors every patient‑AI interaction in real time, with zero forced terminations.
- Quantitative performance metrics: 90 % inclusion of the eventual diagnosis in AMIE’s differential, 75 % top‑3 accuracy, and comparable overall diagnostic quality to PCPs.
- Positive user experience: patients reported high satisfaction and a statistically significant increase in trust toward AI after the chat.
- Clinician impact: PCPs found the AI’s output useful for pre‑visit preparation, though they rated the AI lower on practicality and cost‑effectiveness of management plans.
Methodology
- Study Design – Prospective, single‑arm feasibility study at a leading academic medical center’s urgent‑care clinic.
- Participants – 100 adult patients scheduled for an in‑person appointment; each completed a text‑based chat with AMIE up to five days before the visit.
- AI System – AMIE is built on a state‑of‑the‑art LLM fine‑tuned for medical dialogue, capable of eliciting history, suggesting a differential diagnosis (DDx), and proposing management (Mx) steps.
- Safety Oversight – Dedicated human supervisors watched all chats live, ready to intervene if predefined safety triggers (e.g., unsafe advice, missed red‑flags) occurred. No interventions were needed.
- Evaluation –
- Safety & Quality – Post‑chat surveys, chart review 8 weeks later, and blinded expert rating of DDx and Mx plans.
- User Experience – Patient satisfaction questionnaires (pre‑ and post‑interaction) and clinician feedback on usefulness.
- Performance Metrics – Inclusion of final diagnosis in AI DDx, top‑k accuracy, and statistical comparison to PCP‑generated DDx/Mx.
Results & Findings
| Metric | AMIE | PCP (reference) |
|---|---|---|
| Final diagnosis appears in DDx | 90 % | — |
| Top‑3 DDx accuracy | 75 % | — |
| Overall DDx quality (blinded rating) | No significant difference (p = 0.6) | — |
| Management plan safety & appropriateness | No significant difference (p = 0.1, 1.0) | — |
| Practicality of Mx plan | PCP superior (p = 0.003) | — |
| Cost‑effectiveness of Mx plan | PCP superior (p = 0.004) | — |
| Patient satisfaction (post‑chat) | High (significant uplift in AI attitude, p < 0.001) | — |
| Clinician perceived usefulness | Positive impact on preparedness | — |
Interpretation: AMIE can reliably generate a clinically relevant differential diagnosis and safe management suggestions comparable to a human PCP, while being well‑received by patients. Clinicians still view the AI‑generated management plans as less practical and cost‑effective, indicating room for refinement.
Practical Implications
- Pre‑visit triage augmentation – Clinics could deploy a chatbot to collect structured histories, freeing up clinician time for physical exams and complex decision‑making.
- Decision‑support for busy providers – The AI’s DDx list can serve as a “second opinion,” helping clinicians consider less obvious conditions early.
- Patient engagement & education – Interactive symptom collection may improve health literacy and patient confidence, especially for tech‑savvy populations.
- Safety‑by‑design workflow – The study’s real‑time supervision model offers a blueprint for regulatory‑compliant AI rollouts in healthcare settings.
- Potential cost savings – Automating routine history taking could reduce administrative overhead, though the current management recommendations need further optimization to realize full economic benefit.
Limitations & Future Work
- Single‑site, single‑arm design limits generalizability; multi‑center trials are needed.
- Scope of interaction was limited to text chat; voice or multimodal interfaces may affect usability.
- Management plan practicality lagged behind PCPs, suggesting the AI needs better integration of resource constraints and patient preferences.
- Long‑term outcomes (e.g., diagnostic error rates, downstream healthcare utilization) were not measured.
- Future research should explore continuous learning pipelines, integration with electronic health records, and robust post‑deployment monitoring to ensure safety at scale.
Authors
- Peter Brodeur
- Jacob M. Koshy
- Anil Palepu
- Khaled Saab
- Ava Homiar
- Roma Ruparel
- Charles Wu
- Ryutaro Tanno
- Joseph Xu
- Amy Wang
- David Stutz
- Hannah M. Ferrera
- David Barrett
- Lindsey Crowley
- Jihyeon Lee
- Spencer E. Rittner
- Ellery Wulczyn
- Selena K. Zhang
- Elahe Vedadi
- Christine G. Kohn
- Kavita Kulkarni
- Vinay Kadiyala
- Sara Mahdavi
- Wendy Du
- Jessica Williams
- David Feinbloom
- Renee Wong
- Tao Tu
- Petar Sirkovic
- Alessio Orlandi
- Christopher Semturs
- Yun Liu
- Juraj Gottweis
- Dale R. Webster
- Joëlle Barral
- Katherine Chou
- Pushmeet Kohli
- Avinatan Hassidim
- Yossi Matias
- James Manyika
- Rob Fields
- Jonathan X. Li
- Marc L. Cohen
- Vivek Natarajan
- Mike Schaekermann
- Alan Karthikesalingam
- Adam Rodman
Paper Information
- arXiv ID: 2603.08448v1
- Categories: cs.HC, cs.AI, cs.CL, cs.LG
- Published: March 9, 2026
- PDF: Download PDF