[Paper] Empathy Applicability Modeling for General Health Queries
Source: arXiv - 2601.09696v1
Overview
Large language models (LLMs) are being rolled out as virtual assistants in clinical settings, but they still struggle to convey the kind of empathy that patients expect from human doctors. This paper introduces the Empathy Applicability Framework (EAF), a theory‑driven system that predicts when a patient’s health query calls for an emotional response, allowing downstream models to generate more caring replies up front.
Key Contributions
- EAF taxonomy: A structured classification that maps patient queries to “empathy‑applicable” or “non‑applicable” based on clinical severity, contextual cues, and linguistic signals.
- Benchmark dataset: 2,500 real‑world health questions annotated by both domain experts and GPT‑4o, with a high‑agreement human subset for reliable evaluation.
- Empathy‑applicability classifiers: Supervised models trained on the human‑only and GPT‑only labels that outperform heuristic rules and zero‑shot LLM baselines.
- Error analysis & insights: Identification of three persistent failure modes—implicit distress, ambiguous clinical severity, and culturally specific hardship—that guide future annotation and model design.
- Open‑source release: Code, data, and evaluation scripts made publicly available to spur research on anticipatory empathy in healthcare AI.
Methodology
-
Framework design – The authors distilled clinical communication theory into a three‑tier label set:
- Emotion‑reaction applicable – the query warrants an empathetic reaction.
- Interpretation applicable – the query needs empathetic framing or clarification.
- Not applicable – purely informational, no empathy needed.
-
Data collection – Over 2,500 de‑identified patient questions were scraped from public health forums. Each query was annotated independently by:
- Human clinicians (n=3 per item)
- GPT‑4o (prompted with the same rubric).
-
Label consolidation – For the “human‑consensus” subset (≈ 70 % of the data), at least two clinicians agreed on the label. GPT‑4o’s predictions were also compared to this gold standard to measure alignment.
-
Model training – Two families of classifiers were built:
- Traditional ML (Logistic Regression, SVM) using handcrafted linguistic features (e.g., sentiment scores, medical entity density).
- Fine‑tuned LLMs (DistilBERT, LLaMA‑7B) that ingest the raw query text.
-
Baselines – Simple rule‑based heuristics (e.g., presence of “I feel” → empathy) and zero‑shot prompting of GPT‑4o were used for comparison.
-
Evaluation – Accuracy, F1, and Cohen’s κ were reported on a held‑out test split, with separate reporting for the human‑consensus and full dual‑annotated sets.
Results & Findings
| Model | Accuracy (Human‑Consensus) | F1 (macro) | Notes |
|---|---|---|---|
| Rule‑based heuristic | 62 % | 0.58 | Misses subtle distress |
| Zero‑shot GPT‑4o | 71 % | 0.66 | Better but inconsistent on ambiguous cases |
| Logistic Regression (hand‑crafted) | 78 % | 0.74 | Gains from medical entity features |
| Fine‑tuned DistilBERT | 84 % | 0.81 | Strongest overall performance |
| Fine‑tuned LLaMA‑7B | 86 % | 0.84 | Outperforms all baselines |
- Human‑GPT alignment: On the consensus subset, GPT‑4o agreed with clinicians 78 % of the time, indicating that LLMs can approximate expert judgment when guided by a clear rubric.
- Error hotspots: The models most often faltered on queries that hinted at distress without explicit affect words (e.g., “my blood pressure has been rising”), on questions where clinical severity was unclear, and on culturally specific expressions of hardship.
Practical Implications
- Pre‑screening for empathy: Integrating an EAF classifier into a health‑chatbot pipeline enables the system to flag queries that need an empathetic tone before generating a response, ensuring that the downstream language model selects an appropriate style template.
- Asynchronous care platforms: Tele‑triage services and patient portals can route empathy‑applicable messages to human clinicians or to higher‑fidelity LLMs, improving patient satisfaction without sacrificing scalability.
- Developer tooling: The released benchmark can be used to fine‑tune custom models or to evaluate existing chat‑LLMs for empathy awareness, giving product teams a concrete metric beyond generic accuracy.
- Regulatory compliance: Demonstrating that an AI system actively assesses empathy needs may help satisfy emerging guidelines around “human‑centred” AI in healthcare.
Limitations & Future Work
- Annotation diversity: The human annotators were primarily English‑speaking clinicians from a single geographic region, limiting cultural generalizability.
- Scope of queries: The dataset focuses on general health questions; specialty‑specific or emergency‑level queries may require a different empathy taxonomy.
- Model interpretability: While fine‑tuned LLMs perform best, their decision logic remains opaque, which could hinder trust in high‑stakes settings.
- Next steps: The authors propose multi‑annotator pipelines that include patients, cross‑cultural clinicians, and continuous clinician‑in‑the‑loop calibration to refine the framework and broaden its applicability.
Bottom line: By shifting empathy detection from a post‑hoc label to a proactive classification step, the Empathy Applicability Framework equips developers with a practical lever to make AI‑driven health assistants feel more human—without sacrificing the speed and scale that make LLMs attractive in the first place.
Authors
- Shan Randhawa
- Agha Ali Raza
- Kentaro Toyama
- Julie Hui
- Mustafa Naseem
Paper Information
- arXiv ID: 2601.09696v1
- Categories: cs.CL
- Published: January 14, 2026
- PDF: Download PDF