[Paper] Developing and evaluating a chatbot to support maternal health care

Published: (March 13, 2026 at 01:02 PM EDT)
5 min read
Source: arXiv

Source: arXiv

Source: arXiv:2603.13168v1

Overview

A multidisciplinary team built and rigorously evaluated a phone‑based chatbot that delivers maternal‑health information to women in India—especially those with low health literacy and limited access to care. The work tackles the technical hurdles of short, code‑mixed queries, region‑specific medical knowledge, and safe triage, offering a blueprint for trustworthy AI assistants in high‑stakes, low‑resource environments.

Key Contributions

  • Stage‑aware triage engine – flags high‑risk (emergency) queries and routes them to expert‑crafted response templates.
  • Hybrid retrieval system – pulls relevant passages from curated maternal and newborn care guidelines, handling multilingual and noisy user input.
  • Evidence‑conditioned generation – uses a large language model (LLM) that grounds its answers in the retrieved medical evidence.
  • Comprehensive evaluation workflow covering:
    1. A labeled triage benchmark (150 real queries) achieving 86.7 % emergency recall and an explicit analysis of missed‑emergency vs. over‑escalation trade‑offs.
    2. A synthetic multi‑evidence retrieval benchmark (100 queries) with chunk‑level relevance labels.
    3. LLM‑as‑judge assessments on 781 real user questions, employing clinician‑designed scoring criteria.
    4. End‑to‑end expert validation of the deployed system.
  • Design principle – “defense‑in‑depth”: combine multiple safeguards (triage, retrieval, grounding) and multi‑method evaluation rather than relying on a single model or metric.

Methodology

  1. Data Collection & Curation

    • Partnered with a public‑health nonprofit and a hospital.
    • Gathered Indian maternal‑health guidelines, FAQs, and real user queries (short, mixed Hindi‑English).
  2. Triaging Layer

    • A lightweight classifier predicts the pregnancy stage and urgency level.
    • High‑risk cases are automatically escalated to pre‑written, clinician‑reviewed templates.
  3. Hybrid Retrieval

    • Stage 1 – Lexical Search: Searches the curated guideline corpus.
    • Stage 2 – Neural Re‑ranking: Adjusts results for code‑mixing and noisy spelling.
    • Retrieved chunks are tagged with evidence IDs.
  4. Evidence‑Conditioned Generation

    • An LLM (e.g., GPT‑4‑style) receives the user query plus the top‑k retrieved evidence chunks.
    • The prompt forces the model to produce an answer that explicitly cites the evidence.
  5. Evaluation Suite

    BenchmarkDescriptionMetrics
    Triage BenchmarkHuman annotators label urgency; model predictions are compared.Recall, precision, cost of false negatives vs. false positives
    Retrieval BenchmarkSynthetic queries with known relevant evidence.Chunk‑level recall, precision
    LLM‑as‑JudgeClinicians design a rubric (accuracy, safety, completeness, empathy) and score LLM outputs automatically.Composite rubric score
    Expert ReviewClinicians audit a random sample of end‑to‑end interactions.Qualitative safety and correctness assessment

All components are designed to ensure that the system delivers safe, accurate, and empathetic advice while providing transparent evidence citations.

Results & Findings

ComponentMetricResult
TriageEmergency recall86.7 % (missed‑emergency rate ≈ 13 %)
Over‑escalation (false positives)Controlled to keep clinician workload manageable
RetrievalChunk‑level recall (synthetic benchmark)> 80 % of relevant evidence retrieved in top‑5
GenerationLLM‑as‑judge overall safety scoreComparable to clinician‑crafted templates on 78 % of cases
End‑to‑endExpert validation pass rate> 90 % of sampled dialogues deemed safe and accurate

The results show that a layered approach can achieve high safety recall while keeping the chatbot useful for routine queries. Purely generative models without grounding performed noticeably worse on the safety rubric.

Practical Implications

  • Scalable maternal‑health support – Health NGOs and tele‑medicine providers can deploy a similar stack to reach underserved pregnant women via basic mobile phones, reducing unnecessary clinic visits.
  • Template‑driven escalation – The triage‑to‑template pipeline offers a low‑cost way to guarantee safe handling of emergencies without needing a 24/7 human call‑center.
  • Multilingual robustness – The hybrid retrieval design tolerates code‑mixed Hindi‑English input, a common pattern in Indian vernacular communication, making the approach reusable for other multilingual contexts.
  • Evaluation blueprint – The multi‑method benchmark suite can be adopted by AI product teams building high‑stakes conversational agents (e.g., mental‑health bots, legal assistants) to satisfy regulatory and safety requirements.
  • Open‑source potential – Curated guideline corpora and the triage benchmark could be released as community resources, accelerating research on trustworthy health chatbots.

Limitations & Future Work

  • Limited expert supervision – Evaluation relied on a relatively small set of clinician annotations (150 triage cases, 781 LLM judgments), which may not capture the full diversity of real‑world emergencies.
  • Geographic specificity – Guidelines and evidence are India‑centric; adapting the system to other regions will require new curated corpora and retraining of the triage model.
  • LLM hallucination risk – Although evidence grounding reduces hallucinations, occasional mismatches between cited evidence and generated text were observed, necessitating tighter verification mechanisms.
  • User‑experience study – The paper focuses on technical performance; longitudinal user studies to assess trust, adherence, and health outcomes remain an open avenue.
  • Automation of evidence labeling – Future work could explore semi‑supervised methods to expand the evidence‑chunk annotations without exhaustive manual labeling.

Authors

  • Benjamin Bellows
  • Gretchen Chapman
  • Siddhartha Goyal
  • Vidhi Jain
  • Smriti Jha
  • Grace Liu
  • Jitender Nagpal
  • Sowmya Ramesh
  • Aarti Singh
  • Bryan Wilder
  • Jianyu Xu

Paper Information

FieldDetails
arXiv ID2603.13168v1
Categoriescs.AI, cs.CL, cs.IR
PublishedMarch 13, 2026
PDFDownload PDF
0 views
Back to Blog

Related posts

Read more »

Improving RAG Systems with PageIndex

The Hidden Problem with Traditional RAG Most RAG pipelines follow a similar workflow: 1. Documents are split into chunks. 2. Each chunk is converted into embed...