[Paper] SymptomAI: Towards a Conversational AI Agent for Everyday Symptom Assessment

Published: (May 5, 2026 at 01:36 PM EDT)
5 min read
Source: arXiv

Source: arXiv - 2605.04012v1

Overview

A new study introduces SymptomAI, a suite of conversational agents embedded in the Fitbit app that interview users about their everyday health symptoms and generate differential diagnoses. By testing the agents with almost 14 K real‑world participants, the researchers show that a structured, symptom‑focused interview dramatically improves diagnostic accuracy over the more casual, user‑driven chats that most consumer LLMs employ today.

Key Contributions

  • Large‑scale real‑world deployment: 13,917 participants interacted with five distinct AI agents via a popular wearable platform.
  • Rigorous clinical evaluation: 1,228 users supplied clinician‑verified diagnoses; 517 of these cases were double‑checked by an independent panel of physicians (250+ hours of annotation).
  • Demonstrated diagnostic superiority: SymptomAI’s differential diagnoses were 2.47× more likely to match the clinician’s label than diagnoses produced by clinicians who only saw the raw dialogue (p < 0.001).
  • Agentic interview design matters: Agents that first conduct a systematic symptom interview before offering a diagnosis outperform “user‑guided” agents that let the conversation flow freely (p < 0.001).
  • Physiological validation: Using the AI‑generated labels, the team linked >500 K days of wearable data to ~400 conditions, uncovering strong physiological signatures (e.g., OR > 7 for influenza).
  • General‑population robustness: An auxiliary analysis on 1,509 conversations from a broader US panel confirmed that findings extend beyond Fitbit users.

Methodology

  1. Agent variants – Five conversational bots were built on top of large language models (LLMs). Two were “agentic”: they followed a scripted, evidence‑based symptom interview (ask about onset, severity, associated features, etc.) before proposing a diagnosis. The other three were “user‑guided”: they responded directly to whatever the user typed, mimicking typical consumer chatbots.
  2. Deployment – The agents were integrated into the Fitbit mobile app. Participants were randomly assigned to one of the five bots and asked to describe any health concerns they were experiencing.
  3. Ground‑truth collection – After the AI interview, users could optionally upload a clinician‑provided diagnosis (e.g., from a recent doctor visit). This yielded 1,228 self‑reported clinical labels.
  4. Clinical adjudication – A separate panel of physicians reviewed the full AI‑user dialogue (blinded to the AI’s output) and supplied their own differential diagnosis for 517 of the cases.
  5. Statistical analysis – Diagnostic agreement was measured using odds ratios and significance testing. Wearable sensor streams (heart rate, temperature, activity) were aligned with the AI‑derived condition labels to explore physiological correlates.

Results & Findings

  • Diagnostic accuracy: SymptomAI’s agentic bots matched the clinician’s diagnosis in 42 % of adjudicated cases versus 23 % for the clinician‑only baseline (OR = 2.47, p < 0.001).
  • Interview style effect: Structured symptom interviews boosted accuracy by ~15 percentage points over user‑guided chats (p < 0.001).
  • Physiological signatures: Acute infections (influenza, COVID‑19) showed the strongest wearable changes—elevated resting heart rate and reduced activity—yielding odds ratios > 7 when compared to healthy periods.
  • Generalizability: The same performance gap between agentic and user‑guided bots appeared in the external US panel, indicating the effect is not limited to Fitbit’s user base.

Practical Implications

  • Better consumer health assistants: Embedding a brief, evidence‑based symptom interview into any LLM‑powered health chatbot can raise diagnostic relevance, making the tool more trustworthy for users seeking triage advice.
  • Integration with wearables: Linking AI‑generated condition labels to continuous sensor data enables early detection of disease patterns (e.g., spotting an influenza outbreak from aggregated heart‑rate spikes).
  • Clinical decision support: Front‑line clinicians could receive a pre‑populated symptom checklist from the AI, reducing interview time and standardizing data capture.
  • Regulatory pathways: Demonstrating a measurable improvement over clinician‑only interpretation may help satisfy FDA or other health‑technology regulators when positioning such agents as “clinical decision‑support” rather than pure consumer chatbots.
  • Product roadmap for health apps: Companies can differentiate their offerings by moving from open‑ended chat to a guided interview flow, potentially unlocking new revenue streams (e.g., premium symptom‑tracking subscriptions).

Limitations & Future Work

  • Self‑reported ground truth: The “clinician diagnosis” used for labeling relies on users uploading their own records, which may be incomplete or inaccurate.
  • Population bias: Although an external panel was added, the primary cohort consists of Fitbit users who may be more health‑conscious and tech‑savvy than the general public.
  • Scope of conditions: The study focused on common acute illnesses; performance on chronic, multi‑system diseases remains untested.
  • Explainability: The agents provide a diagnosis but limited rationale; future work should surface reasoning to improve user trust and clinician acceptance.
  • Regulatory compliance: Further validation under controlled clinical trials will be needed before deployment as a medical device or diagnostic aid.

SymptomAI shows that a modest change in conversation design—asking the right questions first—can turn a generic LLM into a genuinely useful health assistant. As developers integrate AI into health‑tech products, the lesson is clear: structure matters, and pairing conversational AI with wearable data opens a powerful new frontier for early disease detection.

Authors

  • Joseph Breda
  • Fadi Yousif
  • Beszel Hawkins
  • Marinela Cotoi
  • Miao Liu
  • Ray Luo
  • Po-Hsuan Cameron Chen
  • Mike Schaekermann
  • Samuel Schmidgall
  • Xin Liu
  • Girish Narayanswamy
  • Samuel Solomon
  • Maxwell A. Xu
  • Xiaoran Fan
  • Longfei Shangguan
  • Anran Wang
  • Bhavna Daryani
  • Buddy Herkenham
  • Cara Tan
  • Mark Malhotra
  • Shwetak Patel
  • John B. Hernandez
  • Quang Duong
  • Yun Liu
  • Zach Wasson
  • Dimitrios Antos
  • Bob Lou
  • Matthew Thompson
  • Jonathan Richina
  • Anupam Pathak
  • Nichole Young-Lin
  • Jake Sunshine
  • Daniel McDuff

Paper Information

  • arXiv ID: 2605.04012v1
  • Categories: cs.AI
  • Published: May 5, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] Normalizing Trajectory Models

Diffusion-based models decompose sampling into many small Gaussian denoising steps -- an assumption that breaks down when generation is compressed to a few coar...