[Paper] HealthNLP_Retrievers at ArchEHR-QA 2026: Cascaded LLM Pipeline for Grounded Clinical Question Answering
Source: arXiv - 2604.26880v1
Overview
The HealthNLP_Retrievers team tackled the ArchEHR‑QA 2026 shared task, which asks systems to answer patient‑authored questions using evidence drawn directly from electronic health records (EHRs). Their solution is a cascaded pipeline that stitches together a few‑shot LLM (Gemini 2.5 Pro) with classic heuristic ranking, yielding answers that are both clinically accurate and tightly grounded in the source notes.
Key Contributions
- Four‑stage cascaded architecture that cleanly separates query reformulation, evidence retrieval, answer generation, and answer‑evidence alignment.
- Few‑shot query reformulation module that condenses verbose patient questions into concise, LLM‑friendly prompts.
- Heuristic evidence scorer that prioritizes recall when ranking individual clinical sentences, ensuring the downstream generator sees all plausible facts.
- Grounded response generator that is forced to cite only the retrieved evidence, producing professional‑grade answers without hallucination.
- Many‑to‑many alignment framework that maps each generated answer sentence to its supporting clinical sentences, enabling transparent traceability.
- Open‑source release of the full pipeline, facilitating reproducibility and community extensions.
Methodology
- Query Reformulation – Using Gemini 2.5 Pro in a few‑shot setting, the system rewrites a patient’s often‑long, lay‑language question into a short, medically precise query (e.g., “What is my latest HbA1c value?”).
- Evidence Scoring – All sentences from the target EHR note are scored with a lightweight heuristic (keyword overlap, semantic similarity via embeddings, and section‑type weighting). The top‑k sentences are passed forward, maximizing recall.
- Grounded Answer Generation – The same LLM receives the reformulated query plus the selected evidence block and is instructed to only use that evidence when forming the answer. A “ground‑only” prompt template blocks free‑form hallucination.
- Answer‑Evidence Alignment – After generation, a many‑to‑many alignment step matches each answer sentence to one or more source sentences using cosine similarity on sentence embeddings, producing a transparent evidence map that can be displayed in patient portals.
The pipeline is orchestrated with simple Python wrappers and runs end‑to‑end on a single GPU, making it practical for integration into existing health‑tech stacks.
Results & Findings
| Track | System Rank |
|---|---|
| Question Interpretation | 1st |
| Answer Generation | 5th |
| Evidence Identification | 7th |
| Answer‑Evidence Alignment | 9th |
Key takeaways
- The query reformulation stage was the strongest differentiator, achieving top performance in interpreting patient language.
- Grounding the LLM with a strict evidence‑only constraint dramatically reduced hallucinations, though it modestly impacted answer fluency (reflected in the 5th place ranking).
- The heuristic scorer, while simple, delivered high recall, ensuring the generator never missed critical facts.
- The alignment module, though lower‑ranked, provided valuable auditability—a must‑have for clinical deployments.
Practical Implications
- Patient Portals: Deploying this pipeline can turn raw EHR data into patient‑friendly Q&A widgets that explain lab results, medication changes, or care plans without exposing raw clinical jargon.
- Clinical Decision Support: Clinicians can use the same system to quickly retrieve evidence‑backed answers to peer questions, reducing time spent scrolling through notes.
- Regulatory Compliance: The explicit evidence‑answer mapping satisfies audit requirements (e.g., FDA’s “explainability” guidelines) and can be logged for medico‑legal records.
- Scalable Integration: Because the heavy lifting is done by a single LLM call per query and the rest is lightweight heuristics, the solution can be containerized and scaled on cloud GPU instances or even on‑premise inference servers.
Limitations & Future Work
- Evidence Scorer Simplicity: The heuristic ranking may miss nuanced context (e.g., temporal relations) that a learned retriever could capture.
- Answer Fluency vs. Grounding Trade‑off: Enforcing strict grounding sometimes yields terse or stilted responses; future work could explore soft grounding penalties or hybrid generation strategies.
- Domain Generalization: The system was tuned on the ArchEHR‑QA dataset; broader validation across diverse EHR systems (different note structures, coding standards) is needed.
- Privacy & Security: While the pipeline runs on de‑identified data in the shared task, real‑world deployment must incorporate robust PHI safeguards and audit trails.
The HealthNLP_Retrievers team’s code is publicly available on GitHub, inviting developers to experiment, extend, and bring grounded clinical QA to the next generation of health applications.
Authors
- Md Biplob Hosen
- Md Alomgeer Hussein
- Md Akmol Masud
- Omar Faruque
- Tera L Reynolds
- Lujie Karen Chen
Paper Information
- arXiv ID: 2604.26880v1
- Categories: cs.CL, cs.LG
- Published: April 29, 2026
- PDF: Download PDF