[Paper] A prospective clinical feasibility study of a conversational diagnostic AI in an ambulatory primary care clinic

Published: 1 day ago (March 9, 2026 at 10:43 AM EDT)

5 min read

Source: arXiv

Source: arXiv - 2603.08448v1

Overview

A new prospective study puts an LLM‑driven conversational assistant, Articulate Medical Intelligence Explorer (AMIE), into the waiting room of an academic urgent‑care clinic. Over 100 adult patients used a text‑chat interface to collect their history and receive a list of possible diagnoses before seeing their primary‑care provider (PCP). The trial shows that, with real‑time human safety monitoring, the system can operate safely, earn high patient satisfaction, and produce differential diagnoses that closely match those of clinicians.

Key Contributions

First real‑world feasibility trial of a large‑language‑model (LLM) diagnostic chatbot integrated into an ambulatory primary‑care workflow.
Safety supervision framework that monitors every patient‑AI interaction in real time, with zero forced terminations.
Quantitative performance metrics: 90 % inclusion of the eventual diagnosis in AMIE’s differential, 75 % top‑3 accuracy, and comparable overall diagnostic quality to PCPs.
Positive user experience: patients reported high satisfaction and a statistically significant increase in trust toward AI after the chat.
Clinician impact: PCPs found the AI’s output useful for pre‑visit preparation, though they rated the AI lower on practicality and cost‑effectiveness of management plans.

Methodology

Study Design – Prospective, single‑arm feasibility study at a leading academic medical center’s urgent‑care clinic.
Participants – 100 adult patients scheduled for an in‑person appointment; each completed a text‑based chat with AMIE up to five days before the visit.
AI System – AMIE is built on a state‑of‑the‑art LLM fine‑tuned for medical dialogue, capable of eliciting history, suggesting a differential diagnosis (DDx), and proposing management (Mx) steps.
Safety Oversight – Dedicated human supervisors watched all chats live, ready to intervene if predefined safety triggers (e.g., unsafe advice, missed red‑flags) occurred. No interventions were needed.
Evaluation –
- Safety & Quality – Post‑chat surveys, chart review 8 weeks later, and blinded expert rating of DDx and Mx plans.
- User Experience – Patient satisfaction questionnaires (pre‑ and post‑interaction) and clinician feedback on usefulness.
- Performance Metrics – Inclusion of final diagnosis in AI DDx, top‑k accuracy, and statistical comparison to PCP‑generated DDx/Mx.

Results & Findings

Metric	AMIE	PCP (reference)
Final diagnosis appears in DDx	90 %	—
Top‑3 DDx accuracy	75 %	—
Overall DDx quality (blinded rating)	No significant difference (p = 0.6)	—
Management plan safety & appropriateness	No significant difference (p = 0.1, 1.0)	—
Practicality of Mx plan	PCP superior (p = 0.003)	—
Cost‑effectiveness of Mx plan	PCP superior (p = 0.004)	—
Patient satisfaction (post‑chat)	High (significant uplift in AI attitude, p < 0.001)	—
Clinician perceived usefulness	Positive impact on preparedness	—

Interpretation: AMIE can reliably generate a clinically relevant differential diagnosis and safe management suggestions comparable to a human PCP, while being well‑received by patients. Clinicians still view the AI‑generated management plans as less practical and cost‑effective, indicating room for refinement.

Practical Implications

Pre‑visit triage augmentation – Clinics could deploy a chatbot to collect structured histories, freeing up clinician time for physical exams and complex decision‑making.
Decision‑support for busy providers – The AI’s DDx list can serve as a “second opinion,” helping clinicians consider less obvious conditions early.
Patient engagement & education – Interactive symptom collection may improve health literacy and patient confidence, especially for tech‑savvy populations.
Safety‑by‑design workflow – The study’s real‑time supervision model offers a blueprint for regulatory‑compliant AI rollouts in healthcare settings.
Potential cost savings – Automating routine history taking could reduce administrative overhead, though the current management recommendations need further optimization to realize full economic benefit.

Limitations & Future Work

Single‑site, single‑arm design limits generalizability; multi‑center trials are needed.
Scope of interaction was limited to text chat; voice or multimodal interfaces may affect usability.
Management plan practicality lagged behind PCPs, suggesting the AI needs better integration of resource constraints and patient preferences.
Long‑term outcomes (e.g., diagnostic error rates, downstream healthcare utilization) were not measured.
Future research should explore continuous learning pipelines, integration with electronic health records, and robust post‑deployment monitoring to ensure safety at scale.

Authors

Peter Brodeur
Jacob M. Koshy
Anil Palepu
Khaled Saab
Ava Homiar
Roma Ruparel
Charles Wu
Ryutaro Tanno
Joseph Xu
Amy Wang
David Stutz
Hannah M. Ferrera
David Barrett
Lindsey Crowley
Jihyeon Lee
Spencer E. Rittner
Ellery Wulczyn
Selena K. Zhang
Elahe Vedadi
Christine G. Kohn
Kavita Kulkarni
Vinay Kadiyala
Sara Mahdavi
Wendy Du
Jessica Williams
David Feinbloom
Renee Wong
Tao Tu
Petar Sirkovic
Alessio Orlandi
Christopher Semturs
Yun Liu
Juraj Gottweis
Dale R. Webster
Joëlle Barral
Katherine Chou
Pushmeet Kohli
Avinatan Hassidim
Yossi Matias
James Manyika
Rob Fields
Jonathan X. Li
Marc L. Cohen
Vivek Natarajan
Mike Schaekermann
Alan Karthikesalingam
Adam Rodman

Paper Information

arXiv ID: 2603.08448v1
Categories: cs.HC, cs.AI, cs.CL, cs.LG
Published: March 9, 2026
PDF: Download PDF

[Paper] A prospective clinical feasibility study of a conversational diagnostic AI in an ambulatory primary care clinic

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Think Before You Lie: How Reasoning Improves Honesty

[Paper] MSSR: Memory-Aware Adaptive Replay for Continual LLM Fine-Tuning

[Paper] Agentic Critical Training

[Paper] How Far Can Unsupervised RLVR Scale LLM Training?