[Paper] SymptomAI: Towards a Conversational AI Agent for Everyday Symptom Assessment
Source: arXiv - 2605.04012v1
Overview
A new study introduces SymptomAI, a suite of conversational agents embedded in the Fitbit app that interview users about their everyday health symptoms and generate differential diagnoses. By testing the agents with almost 14 K real‑world participants, the researchers show that a structured, symptom‑focused interview dramatically improves diagnostic accuracy over the more casual, user‑driven chats that most consumer LLMs employ today.
Key Contributions
- Large‑scale real‑world deployment: 13,917 participants interacted with five distinct AI agents via a popular wearable platform.
- Rigorous clinical evaluation: 1,228 users supplied clinician‑verified diagnoses; 517 of these cases were double‑checked by an independent panel of physicians (250+ hours of annotation).
- Demonstrated diagnostic superiority: SymptomAI’s differential diagnoses were 2.47× more likely to match the clinician’s label than diagnoses produced by clinicians who only saw the raw dialogue (p < 0.001).
- Agentic interview design matters: Agents that first conduct a systematic symptom interview before offering a diagnosis outperform “user‑guided” agents that let the conversation flow freely (p < 0.001).
- Physiological validation: Using the AI‑generated labels, the team linked >500 K days of wearable data to ~400 conditions, uncovering strong physiological signatures (e.g., OR > 7 for influenza).
- General‑population robustness: An auxiliary analysis on 1,509 conversations from a broader US panel confirmed that findings extend beyond Fitbit users.
Methodology
- Agent variants – Five conversational bots were built on top of large language models (LLMs). Two were “agentic”: they followed a scripted, evidence‑based symptom interview (ask about onset, severity, associated features, etc.) before proposing a diagnosis. The other three were “user‑guided”: they responded directly to whatever the user typed, mimicking typical consumer chatbots.
- Deployment – The agents were integrated into the Fitbit mobile app. Participants were randomly assigned to one of the five bots and asked to describe any health concerns they were experiencing.
- Ground‑truth collection – After the AI interview, users could optionally upload a clinician‑provided diagnosis (e.g., from a recent doctor visit). This yielded 1,228 self‑reported clinical labels.
- Clinical adjudication – A separate panel of physicians reviewed the full AI‑user dialogue (blinded to the AI’s output) and supplied their own differential diagnosis for 517 of the cases.
- Statistical analysis – Diagnostic agreement was measured using odds ratios and significance testing. Wearable sensor streams (heart rate, temperature, activity) were aligned with the AI‑derived condition labels to explore physiological correlates.
Results & Findings
- Diagnostic accuracy: SymptomAI’s agentic bots matched the clinician’s diagnosis in 42 % of adjudicated cases versus 23 % for the clinician‑only baseline (OR = 2.47, p < 0.001).
- Interview style effect: Structured symptom interviews boosted accuracy by ~15 percentage points over user‑guided chats (p < 0.001).
- Physiological signatures: Acute infections (influenza, COVID‑19) showed the strongest wearable changes—elevated resting heart rate and reduced activity—yielding odds ratios > 7 when compared to healthy periods.
- Generalizability: The same performance gap between agentic and user‑guided bots appeared in the external US panel, indicating the effect is not limited to Fitbit’s user base.
Practical Implications
- Better consumer health assistants: Embedding a brief, evidence‑based symptom interview into any LLM‑powered health chatbot can raise diagnostic relevance, making the tool more trustworthy for users seeking triage advice.
- Integration with wearables: Linking AI‑generated condition labels to continuous sensor data enables early detection of disease patterns (e.g., spotting an influenza outbreak from aggregated heart‑rate spikes).
- Clinical decision support: Front‑line clinicians could receive a pre‑populated symptom checklist from the AI, reducing interview time and standardizing data capture.
- Regulatory pathways: Demonstrating a measurable improvement over clinician‑only interpretation may help satisfy FDA or other health‑technology regulators when positioning such agents as “clinical decision‑support” rather than pure consumer chatbots.
- Product roadmap for health apps: Companies can differentiate their offerings by moving from open‑ended chat to a guided interview flow, potentially unlocking new revenue streams (e.g., premium symptom‑tracking subscriptions).
Limitations & Future Work
- Self‑reported ground truth: The “clinician diagnosis” used for labeling relies on users uploading their own records, which may be incomplete or inaccurate.
- Population bias: Although an external panel was added, the primary cohort consists of Fitbit users who may be more health‑conscious and tech‑savvy than the general public.
- Scope of conditions: The study focused on common acute illnesses; performance on chronic, multi‑system diseases remains untested.
- Explainability: The agents provide a diagnosis but limited rationale; future work should surface reasoning to improve user trust and clinician acceptance.
- Regulatory compliance: Further validation under controlled clinical trials will be needed before deployment as a medical device or diagnostic aid.
SymptomAI shows that a modest change in conversation design—asking the right questions first—can turn a generic LLM into a genuinely useful health assistant. As developers integrate AI into health‑tech products, the lesson is clear: structure matters, and pairing conversational AI with wearable data opens a powerful new frontier for early disease detection.
Authors
- Joseph Breda
- Fadi Yousif
- Beszel Hawkins
- Marinela Cotoi
- Miao Liu
- Ray Luo
- Po-Hsuan Cameron Chen
- Mike Schaekermann
- Samuel Schmidgall
- Xin Liu
- Girish Narayanswamy
- Samuel Solomon
- Maxwell A. Xu
- Xiaoran Fan
- Longfei Shangguan
- Anran Wang
- Bhavna Daryani
- Buddy Herkenham
- Cara Tan
- Mark Malhotra
- Shwetak Patel
- John B. Hernandez
- Quang Duong
- Yun Liu
- Zach Wasson
- Dimitrios Antos
- Bob Lou
- Matthew Thompson
- Jonathan Richina
- Anupam Pathak
- Nichole Young-Lin
- Jake Sunshine
- Daniel McDuff
Paper Information
- arXiv ID: 2605.04012v1
- Categories: cs.AI
- Published: May 5, 2026
- PDF: Download PDF