[Paper] Developing and evaluating a chatbot to support maternal health care

Published: 1 month ago (March 13, 2026 at 01:02 PM EDT)

5 min read

Source: arXiv

Source: arXiv

Overview

A multidisciplinary team built and rigorously evaluated a phone‑based chatbot that delivers maternal‑health information to women in India—especially those with low health literacy and limited access to care. The work tackles the technical hurdles of short, code‑mixed queries, region‑specific medical knowledge, and safe triage, offering a blueprint for trustworthy AI assistants in high‑stakes, low‑resource environments.

Key Contributions

Stage‑aware triage engine – flags high‑risk (emergency) queries and routes them to expert‑crafted response templates.
Hybrid retrieval system – pulls relevant passages from curated maternal and newborn care guidelines, handling multilingual and noisy user input.
Evidence‑conditioned generation – uses a large language model (LLM) that grounds its answers in the retrieved medical evidence.
Comprehensive evaluation workflow covering:
1. A labeled triage benchmark (150 real queries) achieving 86.7 % emergency recall and an explicit analysis of missed‑emergency vs. over‑escalation trade‑offs.
2. A synthetic multi‑evidence retrieval benchmark (100 queries) with chunk‑level relevance labels.
3. LLM‑as‑judge assessments on 781 real user questions, employing clinician‑designed scoring criteria.
4. End‑to‑end expert validation of the deployed system.
Design principle – “defense‑in‑depth”: combine multiple safeguards (triage, retrieval, grounding) and multi‑method evaluation rather than relying on a single model or metric.

Methodology

Data Collection & Curation
- Partnered with a public‑health nonprofit and a hospital.
- Gathered Indian maternal‑health guidelines, FAQs, and real user queries (short, mixed Hindi‑English).
Triaging Layer
- A lightweight classifier predicts the pregnancy stage and urgency level.
- High‑risk cases are automatically escalated to pre‑written, clinician‑reviewed templates.
Hybrid Retrieval
- Stage 1 – Lexical Search: Searches the curated guideline corpus.
- Stage 2 – Neural Re‑ranking: Adjusts results for code‑mixing and noisy spelling.
- Retrieved chunks are tagged with evidence IDs.
Evidence‑Conditioned Generation
- An LLM (e.g., GPT‑4‑style) receives the user query plus the top‑k retrieved evidence chunks.
- The prompt forces the model to produce an answer that explicitly cites the evidence.

Evaluation Suite

Benchmark	Description	Metrics
Triage Benchmark	Human annotators label urgency; model predictions are compared.	Recall, precision, cost of false negatives vs. false positives
Retrieval Benchmark	Synthetic queries with known relevant evidence.	Chunk‑level recall, precision
LLM‑as‑Judge	Clinicians design a rubric (accuracy, safety, completeness, empathy) and score LLM outputs automatically.	Composite rubric score
Expert Review	Clinicians audit a random sample of end‑to‑end interactions.	Qualitative safety and correctness assessment

All components are designed to ensure that the system delivers safe, accurate, and empathetic advice while providing transparent evidence citations.

Results & Findings

Component	Metric	Result
Triage	Emergency recall	86.7 % (missed‑emergency rate ≈ 13 %)
	Over‑escalation (false positives)	Controlled to keep clinician workload manageable
Retrieval	Chunk‑level recall (synthetic benchmark)	> 80 % of relevant evidence retrieved in top‑5
Generation	LLM‑as‑judge overall safety score	Comparable to clinician‑crafted templates on 78 % of cases
End‑to‑end	Expert validation pass rate	> 90 % of sampled dialogues deemed safe and accurate

The results show that a layered approach can achieve high safety recall while keeping the chatbot useful for routine queries. Purely generative models without grounding performed noticeably worse on the safety rubric.

Practical Implications

Scalable maternal‑health support – Health NGOs and tele‑medicine providers can deploy a similar stack to reach underserved pregnant women via basic mobile phones, reducing unnecessary clinic visits.
Template‑driven escalation – The triage‑to‑template pipeline offers a low‑cost way to guarantee safe handling of emergencies without needing a 24/7 human call‑center.
Multilingual robustness – The hybrid retrieval design tolerates code‑mixed Hindi‑English input, a common pattern in Indian vernacular communication, making the approach reusable for other multilingual contexts.
Evaluation blueprint – The multi‑method benchmark suite can be adopted by AI product teams building high‑stakes conversational agents (e.g., mental‑health bots, legal assistants) to satisfy regulatory and safety requirements.
Open‑source potential – Curated guideline corpora and the triage benchmark could be released as community resources, accelerating research on trustworthy health chatbots.

Limitations & Future Work

Limited expert supervision – Evaluation relied on a relatively small set of clinician annotations (150 triage cases, 781 LLM judgments), which may not capture the full diversity of real‑world emergencies.
Geographic specificity – Guidelines and evidence are India‑centric; adapting the system to other regions will require new curated corpora and retraining of the triage model.
LLM hallucination risk – Although evidence grounding reduces hallucinations, occasional mismatches between cited evidence and generated text were observed, necessitating tighter verification mechanisms.
User‑experience study – The paper focuses on technical performance; longitudinal user studies to assess trust, adherence, and health outcomes remain an open avenue.
Automation of evidence labeling – Future work could explore semi‑supervised methods to expand the evidence‑chunk annotations without exhaustive manual labeling.

Authors

Benjamin Bellows
Gretchen Chapman
Siddhartha Goyal
Vidhi Jain
Smriti Jha
Grace Liu
Jitender Nagpal
Sowmya Ramesh
Aarti Singh
Bryan Wilder
Jianyu Xu

Paper Information

Field	Details
arXiv ID	`2603.13168v1`
Categories	`cs.AI`, `cs.CL`, `cs.IR`
Published	March 13, 2026
PDF	Download PDF

[Paper] Developing and evaluating a chatbot to support maternal health care

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

Improving RAG Systems with PageIndex

How to Build Agentic RAG with Hybrid Search

Building a Safer AI Co-Pilot: 3 Architecture Patterns from our ICU Hackathon Project

title: Why I Built an AI with a Spine: Anchoring Behavioral Integrity in the Gemini Live API