[Paper] ClinEnv: An Interactive Multi-Stage Long Horizon EHR Environment for Agents

Published: 3 days ago (June 1, 2026 at 01:56 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2606.02568v1

Overview

The paper introduces ClinEnv, a new interactive benchmark that lets large language models (LLMs) act like attending physicians navigating real inpatient electronic health records (EHRs). By simulating the step‑by‑step information‑gathering and decision‑making that clinicians perform, ClinEnv exposes gaps that static, single‑shot benchmarks miss, revealing how well (or poorly) current models handle long‑horizon, multi‑stage clinical reasoning.

Key Contributions

Longitudinal Inpatient Simulation: A framework that automatically converts real hospital admissions into ordered decision stages, mirroring the chronological flow of a patient’s stay.
Four Specialized Query Agents: At each stage the model can request lab results, imaging reports, medication histories, or clinical notes, forcing it to actively acquire information before acting.
Dual Scoring System:
1. Decision Accuracy – deterministic, ontology‑grounded matching of prescribed meds, procedures, and diagnoses (F1 score).
2. Process Quality – evaluation of the relevance and efficiency of the model’s queries.
Comprehensive Baseline Evaluation: Seven state‑of‑the‑art LLMs (including GPT‑4‑style models) are benchmarked, showing a maximum decision F1 of only 0.31.
Insight into Information‑Acquisition Gap: Demonstrates that models can sometimes guess the correct discharge diagnosis (0.51 F1) while consistently failing at management actions (0.17 F1) and over‑querying as cases progress.

Methodology

Data Construction – Real inpatient admissions from a large hospital system are extracted and segmented into a timeline of clinical events (e.g., admission, labs ordered, medication changes).
Stage Definition – Each timeline is split into decision stages (e.g., “initial assessment”, “mid‑stay management”, “pre‑discharge”).
Interactive Loop – For every stage, the LLM can issue up to four typed queries to dedicated agents that return the requested EHR slice (labs, imaging, meds, notes). After gathering information, the model outputs its clinical decisions (diagnoses, meds, procedures).
Scoring –
- Decision F1: compares model output against the ground‑truth ontology (ICD‑10, RxNorm, CPT).
- Process Metrics: precision/recall of queries (did the model ask for needed data?) and redundancy (unnecessary repeats).
Baseline Models – Open‑source and commercial LLMs are evaluated under identical prompts and temperature settings to ensure a fair comparison.

Results & Findings

Model	Decision F1 (overall)	Diagnosis F1	Management F1 (meds + procedures)
Best (GPT‑4‑style)	0.31	0.51	0.17
Others (7 total)	0.12 – 0.28	0.38 – 0.55	0.09 – 0.22

Management actions are the bottleneck: models consistently underperform on prescribing meds or ordering procedures, even when they correctly infer the final diagnosis.
Process quality diverges from outcome quality: high diagnosis scores do not guarantee efficient or relevant queries; many models keep asking for the same lab values late in the timeline.
Stage‑wise difficulty: Early stages (admission) are relatively easier; performance drops sharply in later stages where decisions become more nuanced and data volume grows.
Redundant querying: On average, models issue 1.8 × more queries than needed, indicating poor information‑selection strategies.

Practical Implications

Tooling for Clinical Decision Support (CDS) – ClinEnv provides a realistic testbed for building LLM‑powered assistants that must ask the right questions before suggesting actions, a prerequisite for safe CDS integration.
Benchmark for Retrieval‑Augmented Generation (RAG) – The four‑agent setup mirrors RAG pipelines (retriever + generator). Developers can plug in custom retrievers (e.g., vector stores, knowledge graphs) and directly measure gains in both decision accuracy and query efficiency.
Regulatory & Safety Testing – By exposing the “information‑acquisition gap,” ClinEnv can be used to generate evidence for FDA/EMA‑style risk assessments, showing where a model might hallucinate or over‑order tests.
Training Data Design – The stark drop in management performance suggests that fine‑tuning on longitudinal, multi‑turn clinical dialogues (rather than single‑shot Q&A) could be a more fruitful path for LLM developers.
Developer Workflow – The benchmark’s API‑like interface (stage → query → decision) can be wrapped into CI pipelines, enabling continuous evaluation as models evolve.

Limitations & Future Work

Single‑Institution Data – All admissions come from one health system, which may limit generalizability across different EHR schemas or practice patterns.
Ontology Dependence – Scoring relies on exact matches to ICD‑10/RxNorm/CPT; partial correctness (e.g., clinically appropriate but differently coded) is penalized.
No Real‑Time Constraints – The simulation does not enforce latency or compute budgets that would be present in a bedside assistant.
Future Directions suggested by the authors include: expanding to multi‑center datasets, incorporating cost‑aware query penalties, adding patient‑outcome simulations (e.g., mortality risk), and exploring reinforcement‑learning‑based agents that can learn optimal query policies.

Authors

Yuxing Lu
Yushuhong Lin
Wenqi Shi
J. Ben Tamo
Xukai Zhao
Jinzhuo Wang
May Dongmei Wang

Paper Information

arXiv ID: 2606.02568v1
Categories: cs.AI, cs.CL, cs.ET, cs.MA
Published: June 1, 2026
PDF: Download PDF

[Paper] ClinEnv: An Interactive Multi-Stage Long Horizon EHR Environment for Agents

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] STRIDE: Training Data Attribution via Sparse Recovery from Subset Perturbations

[Paper] Streaming Communication in Multi-Agent Reasoning

[Paper] Reinforcement Learning from Rich Feedback with Distributional DAgger

[Paper] Failed Reasoning Traces Tell You What Is Fixable (But Not by Reading Them)