[Paper] HealthNLP_Retrievers at ArchEHR-QA 2026: Cascaded LLM Pipeline for Grounded Clinical Question Answering

Published: 5 days ago (April 29, 2026 at 12:47 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2604.26880v1

Overview

The HealthNLP_Retrievers team tackled the ArchEHR‑QA 2026 shared task, which asks systems to answer patient‑authored questions using evidence drawn directly from electronic health records (EHRs). Their solution is a cascaded pipeline that stitches together a few‑shot LLM (Gemini 2.5 Pro) with classic heuristic ranking, yielding answers that are both clinically accurate and tightly grounded in the source notes.

Key Contributions

Four‑stage cascaded architecture that cleanly separates query reformulation, evidence retrieval, answer generation, and answer‑evidence alignment.
Few‑shot query reformulation module that condenses verbose patient questions into concise, LLM‑friendly prompts.
Heuristic evidence scorer that prioritizes recall when ranking individual clinical sentences, ensuring the downstream generator sees all plausible facts.
Grounded response generator that is forced to cite only the retrieved evidence, producing professional‑grade answers without hallucination.
Many‑to‑many alignment framework that maps each generated answer sentence to its supporting clinical sentences, enabling transparent traceability.
Open‑source release of the full pipeline, facilitating reproducibility and community extensions.

Methodology

Query Reformulation – Using Gemini 2.5 Pro in a few‑shot setting, the system rewrites a patient’s often‑long, lay‑language question into a short, medically precise query (e.g., “What is my latest HbA1c value?”).
Evidence Scoring – All sentences from the target EHR note are scored with a lightweight heuristic (keyword overlap, semantic similarity via embeddings, and section‑type weighting). The top‑k sentences are passed forward, maximizing recall.
Grounded Answer Generation – The same LLM receives the reformulated query plus the selected evidence block and is instructed to only use that evidence when forming the answer. A “ground‑only” prompt template blocks free‑form hallucination.
Answer‑Evidence Alignment – After generation, a many‑to‑many alignment step matches each answer sentence to one or more source sentences using cosine similarity on sentence embeddings, producing a transparent evidence map that can be displayed in patient portals.

The pipeline is orchestrated with simple Python wrappers and runs end‑to‑end on a single GPU, making it practical for integration into existing health‑tech stacks.

Results & Findings

Track	System Rank
Question Interpretation	1st
Answer Generation	5th
Evidence Identification	7th
Answer‑Evidence Alignment	9th

Key takeaways

The query reformulation stage was the strongest differentiator, achieving top performance in interpreting patient language.
Grounding the LLM with a strict evidence‑only constraint dramatically reduced hallucinations, though it modestly impacted answer fluency (reflected in the 5th place ranking).
The heuristic scorer, while simple, delivered high recall, ensuring the generator never missed critical facts.
The alignment module, though lower‑ranked, provided valuable auditability—a must‑have for clinical deployments.

Practical Implications

Patient Portals: Deploying this pipeline can turn raw EHR data into patient‑friendly Q&A widgets that explain lab results, medication changes, or care plans without exposing raw clinical jargon.
Clinical Decision Support: Clinicians can use the same system to quickly retrieve evidence‑backed answers to peer questions, reducing time spent scrolling through notes.
Regulatory Compliance: The explicit evidence‑answer mapping satisfies audit requirements (e.g., FDA’s “explainability” guidelines) and can be logged for medico‑legal records.
Scalable Integration: Because the heavy lifting is done by a single LLM call per query and the rest is lightweight heuristics, the solution can be containerized and scaled on cloud GPU instances or even on‑premise inference servers.

Limitations & Future Work

Evidence Scorer Simplicity: The heuristic ranking may miss nuanced context (e.g., temporal relations) that a learned retriever could capture.
Answer Fluency vs. Grounding Trade‑off: Enforcing strict grounding sometimes yields terse or stilted responses; future work could explore soft grounding penalties or hybrid generation strategies.
Domain Generalization: The system was tuned on the ArchEHR‑QA dataset; broader validation across diverse EHR systems (different note structures, coding standards) is needed.
Privacy & Security: While the pipeline runs on de‑identified data in the shared task, real‑world deployment must incorporate robust PHI safeguards and audit trails.

The HealthNLP_Retrievers team’s code is publicly available on GitHub, inviting developers to experiment, extend, and bring grounded clinical QA to the next generation of health applications.

Authors

Md Biplob Hosen
Md Alomgeer Hussein
Md Akmol Masud
Omar Faruque
Tera L Reynolds
Lujie Karen Chen

Paper Information

arXiv ID: 2604.26880v1
Categories: cs.CL, cs.LG
Published: April 29, 2026
PDF: Download PDF

[Paper] HealthNLP_Retrievers at ArchEHR-QA 2026: Cascaded LLM Pipeline for Grounded Clinical Question Answering

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Can Coding Agents Reproduce Findings in Computational Materials Science?

[Paper] RunAgent: Interpreting Natural-Language Plans with Constraint-Guided Execution

[Paper] When RAG Chatbots Expose Their Backend: An Anonymized Case Study of Privacy and Security Risks in Patient-Facing Medical AI

[Paper] Directed Social Regard: Surfacing Targeted Advocacy, Opposition, Aid, Harms, and Victimization in Online Media