[Paper] Developing and evaluating a chatbot to support maternal health care
Source: arXiv
Source: arXiv:2603.13168v1
Overview
A multidisciplinary team built and rigorously evaluated a phone‑based chatbot that delivers maternal‑health information to women in India—especially those with low health literacy and limited access to care. The work tackles the technical hurdles of short, code‑mixed queries, region‑specific medical knowledge, and safe triage, offering a blueprint for trustworthy AI assistants in high‑stakes, low‑resource environments.
Key Contributions
- Stage‑aware triage engine – flags high‑risk (emergency) queries and routes them to expert‑crafted response templates.
- Hybrid retrieval system – pulls relevant passages from curated maternal and newborn care guidelines, handling multilingual and noisy user input.
- Evidence‑conditioned generation – uses a large language model (LLM) that grounds its answers in the retrieved medical evidence.
- Comprehensive evaluation workflow covering:
- A labeled triage benchmark (150 real queries) achieving 86.7 % emergency recall and an explicit analysis of missed‑emergency vs. over‑escalation trade‑offs.
- A synthetic multi‑evidence retrieval benchmark (100 queries) with chunk‑level relevance labels.
- LLM‑as‑judge assessments on 781 real user questions, employing clinician‑designed scoring criteria.
- End‑to‑end expert validation of the deployed system.
- Design principle – “defense‑in‑depth”: combine multiple safeguards (triage, retrieval, grounding) and multi‑method evaluation rather than relying on a single model or metric.
Methodology
Data Collection & Curation
- Partnered with a public‑health nonprofit and a hospital.
- Gathered Indian maternal‑health guidelines, FAQs, and real user queries (short, mixed Hindi‑English).
Triaging Layer
- A lightweight classifier predicts the pregnancy stage and urgency level.
- High‑risk cases are automatically escalated to pre‑written, clinician‑reviewed templates.
Hybrid Retrieval
- Stage 1 – Lexical Search: Searches the curated guideline corpus.
- Stage 2 – Neural Re‑ranking: Adjusts results for code‑mixing and noisy spelling.
- Retrieved chunks are tagged with evidence IDs.
Evidence‑Conditioned Generation
- An LLM (e.g., GPT‑4‑style) receives the user query plus the top‑k retrieved evidence chunks.
- The prompt forces the model to produce an answer that explicitly cites the evidence.
Evaluation Suite
Benchmark Description Metrics Triage Benchmark Human annotators label urgency; model predictions are compared. Recall, precision, cost of false negatives vs. false positives Retrieval Benchmark Synthetic queries with known relevant evidence. Chunk‑level recall, precision LLM‑as‑Judge Clinicians design a rubric (accuracy, safety, completeness, empathy) and score LLM outputs automatically. Composite rubric score Expert Review Clinicians audit a random sample of end‑to‑end interactions. Qualitative safety and correctness assessment
All components are designed to ensure that the system delivers safe, accurate, and empathetic advice while providing transparent evidence citations.
Results & Findings
| Component | Metric | Result |
|---|---|---|
| Triage | Emergency recall | 86.7 % (missed‑emergency rate ≈ 13 %) |
| Over‑escalation (false positives) | Controlled to keep clinician workload manageable | |
| Retrieval | Chunk‑level recall (synthetic benchmark) | > 80 % of relevant evidence retrieved in top‑5 |
| Generation | LLM‑as‑judge overall safety score | Comparable to clinician‑crafted templates on 78 % of cases |
| End‑to‑end | Expert validation pass rate | > 90 % of sampled dialogues deemed safe and accurate |
The results show that a layered approach can achieve high safety recall while keeping the chatbot useful for routine queries. Purely generative models without grounding performed noticeably worse on the safety rubric.
Practical Implications
- Scalable maternal‑health support – Health NGOs and tele‑medicine providers can deploy a similar stack to reach underserved pregnant women via basic mobile phones, reducing unnecessary clinic visits.
- Template‑driven escalation – The triage‑to‑template pipeline offers a low‑cost way to guarantee safe handling of emergencies without needing a 24/7 human call‑center.
- Multilingual robustness – The hybrid retrieval design tolerates code‑mixed Hindi‑English input, a common pattern in Indian vernacular communication, making the approach reusable for other multilingual contexts.
- Evaluation blueprint – The multi‑method benchmark suite can be adopted by AI product teams building high‑stakes conversational agents (e.g., mental‑health bots, legal assistants) to satisfy regulatory and safety requirements.
- Open‑source potential – Curated guideline corpora and the triage benchmark could be released as community resources, accelerating research on trustworthy health chatbots.
Limitations & Future Work
- Limited expert supervision – Evaluation relied on a relatively small set of clinician annotations (150 triage cases, 781 LLM judgments), which may not capture the full diversity of real‑world emergencies.
- Geographic specificity – Guidelines and evidence are India‑centric; adapting the system to other regions will require new curated corpora and retraining of the triage model.
- LLM hallucination risk – Although evidence grounding reduces hallucinations, occasional mismatches between cited evidence and generated text were observed, necessitating tighter verification mechanisms.
- User‑experience study – The paper focuses on technical performance; longitudinal user studies to assess trust, adherence, and health outcomes remain an open avenue.
- Automation of evidence labeling – Future work could explore semi‑supervised methods to expand the evidence‑chunk annotations without exhaustive manual labeling.
Authors
- Benjamin Bellows
- Gretchen Chapman
- Siddhartha Goyal
- Vidhi Jain
- Smriti Jha
- Grace Liu
- Jitender Nagpal
- Sowmya Ramesh
- Aarti Singh
- Bryan Wilder
- Jianyu Xu
Paper Information
| Field | Details |
|---|---|
| arXiv ID | 2603.13168v1 |
| Categories | cs.AI, cs.CL, cs.IR |
| Published | March 13, 2026 |
| Download PDF |