[Paper] HealthNLP_Retrievers at ArchEHR-QA 2026: Cascaded LLM Pipeline for Grounded Clinical Question Answering

Published: (April 29, 2026 at 12:47 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2604.26880v1

Overview

The HealthNLP_Retrievers team tackled the ArchEHR‑QA 2026 shared task, which asks systems to answer patient‑authored questions using evidence drawn directly from electronic health records (EHRs). Their solution is a cascaded pipeline that stitches together a few‑shot LLM (Gemini 2.5 Pro) with classic heuristic ranking, yielding answers that are both clinically accurate and tightly grounded in the source notes.

Key Contributions

  • Four‑stage cascaded architecture that cleanly separates query reformulation, evidence retrieval, answer generation, and answer‑evidence alignment.
  • Few‑shot query reformulation module that condenses verbose patient questions into concise, LLM‑friendly prompts.
  • Heuristic evidence scorer that prioritizes recall when ranking individual clinical sentences, ensuring the downstream generator sees all plausible facts.
  • Grounded response generator that is forced to cite only the retrieved evidence, producing professional‑grade answers without hallucination.
  • Many‑to‑many alignment framework that maps each generated answer sentence to its supporting clinical sentences, enabling transparent traceability.
  • Open‑source release of the full pipeline, facilitating reproducibility and community extensions.

Methodology

  1. Query Reformulation – Using Gemini 2.5 Pro in a few‑shot setting, the system rewrites a patient’s often‑long, lay‑language question into a short, medically precise query (e.g., “What is my latest HbA1c value?”).
  2. Evidence Scoring – All sentences from the target EHR note are scored with a lightweight heuristic (keyword overlap, semantic similarity via embeddings, and section‑type weighting). The top‑k sentences are passed forward, maximizing recall.
  3. Grounded Answer Generation – The same LLM receives the reformulated query plus the selected evidence block and is instructed to only use that evidence when forming the answer. A “ground‑only” prompt template blocks free‑form hallucination.
  4. Answer‑Evidence Alignment – After generation, a many‑to‑many alignment step matches each answer sentence to one or more source sentences using cosine similarity on sentence embeddings, producing a transparent evidence map that can be displayed in patient portals.

The pipeline is orchestrated with simple Python wrappers and runs end‑to‑end on a single GPU, making it practical for integration into existing health‑tech stacks.

Results & Findings

TrackSystem Rank
Question Interpretation1st
Answer Generation5th
Evidence Identification7th
Answer‑Evidence Alignment9th

Key takeaways

  • The query reformulation stage was the strongest differentiator, achieving top performance in interpreting patient language.
  • Grounding the LLM with a strict evidence‑only constraint dramatically reduced hallucinations, though it modestly impacted answer fluency (reflected in the 5th place ranking).
  • The heuristic scorer, while simple, delivered high recall, ensuring the generator never missed critical facts.
  • The alignment module, though lower‑ranked, provided valuable auditability—a must‑have for clinical deployments.

Practical Implications

  • Patient Portals: Deploying this pipeline can turn raw EHR data into patient‑friendly Q&A widgets that explain lab results, medication changes, or care plans without exposing raw clinical jargon.
  • Clinical Decision Support: Clinicians can use the same system to quickly retrieve evidence‑backed answers to peer questions, reducing time spent scrolling through notes.
  • Regulatory Compliance: The explicit evidence‑answer mapping satisfies audit requirements (e.g., FDA’s “explainability” guidelines) and can be logged for medico‑legal records.
  • Scalable Integration: Because the heavy lifting is done by a single LLM call per query and the rest is lightweight heuristics, the solution can be containerized and scaled on cloud GPU instances or even on‑premise inference servers.

Limitations & Future Work

  • Evidence Scorer Simplicity: The heuristic ranking may miss nuanced context (e.g., temporal relations) that a learned retriever could capture.
  • Answer Fluency vs. Grounding Trade‑off: Enforcing strict grounding sometimes yields terse or stilted responses; future work could explore soft grounding penalties or hybrid generation strategies.
  • Domain Generalization: The system was tuned on the ArchEHR‑QA dataset; broader validation across diverse EHR systems (different note structures, coding standards) is needed.
  • Privacy & Security: While the pipeline runs on de‑identified data in the shared task, real‑world deployment must incorporate robust PHI safeguards and audit trails.

The HealthNLP_Retrievers team’s code is publicly available on GitHub, inviting developers to experiment, extend, and bring grounded clinical QA to the next generation of health applications.

Authors

  • Md Biplob Hosen
  • Md Alomgeer Hussein
  • Md Akmol Masud
  • Omar Faruque
  • Tera L Reynolds
  • Lujie Karen Chen

Paper Information

  • arXiv ID: 2604.26880v1
  • Categories: cs.CL, cs.LG
  • Published: April 29, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »