[Paper] From Data Lifting to Continuous Risk Estimation: A Process-Aware Pipeline for Predictive Monitoring of Clinical Pathways

Published: 5 days ago (May 5, 2026 at 11:51 AM EDT)

4 min read

Source: arXiv

Source: arXiv - 2605.03895v1

Overview

The authors introduce a process‑aware predictive‑monitoring pipeline that continuously estimates patient risk as clinical pathways unfold. By turning raw electronic health records into temporally ordered event logs and feeding them to standard machine‑learning models, the framework can update risk scores in real time—demonstrated on COVID‑19 ICU admission prediction.

Key Contributions

End‑to‑end reproducible pipeline that bridges raw health data (“data lifting”) to process‑aware predictive models.
Temporal reconstruction of patient journeys, turning irregular timestamps into ordered event prefixes suitable for incremental prediction.
Prefix‑based representation that captures the “what‑has‑happened‑so‑far” state of each case, enabling continuous risk estimation.
Empirical evaluation on a large COVID‑19 cohort (4,479 patients, 46,804 prefixes) showing strong early‑warning performance (AUC ≈ 0.90).
Insightful analysis of signal emergence, proving that predictive power grows as more clinical events become available.

Methodology

Data Lifting – Raw EHR tables (lab results, procedures, vitals) are flattened into a unified schema of events (e.g., “oxygen therapy started”).
Temporal Reconstruction – Each patient’s timestamps are sorted, gaps are filled, and a case timeline is built.
Event Log Construction – The timeline is transformed into an event log (a standard artifact in process mining) where each row is a (case‑id, activity, timestamp).
Prefix Generation – For every case, all possible prefixes are extracted (e.g., after 1st event, after 2nd event, …). Each prefix represents the patient’s state at a given moment.
Feature Engineering – Prefixes are encoded using a mix of:
- One‑hot activity counts (how many times each clinical activity has occurred)
- Temporal features (time since admission, time since last event)
- Aggregated clinical measurements (latest lab values, moving averages)
Predictive Modeling – Conventional classifiers (Logistic Regression, Random Forest, XGBoost) are trained on the prefix features to predict the binary target ICU admission. A case‑level split ensures that all prefixes of a patient stay in either training or test set, avoiding leakage.
Evaluation – Metrics (AUC, F1‑score) are computed per‑prefix length to assess how early reliable predictions can be made.

Results & Findings

Model	Overall AUC	Overall F1
Logistic Regression	0.906	0.835
Random Forest	0.889	0.812
XGBoost	0.902	0.828

Early‑stage performance: With only the first few events, AUC ≈ 0.64 – still better than random, indicating that even minimal information carries signal.
Mid‑stage performance: After ~5 events, AUC climbs to ≈ 0.80.
Late‑stage performance: Near the end of the pathway, AUC reaches 0.94, showing that the model can become highly confident when more data is available.

The analysis confirms two central observations:

Predictive signals emerge gradually; the more of the patient journey we observe, the sharper the risk estimate.
Process‑aware representations (prefixes) are crucial for capturing the evolving context, outperforming naïve “snapshot” models that ignore temporal ordering.

Practical Implications

Real‑time clinical decision support – Hospitals can embed the pipeline into their EHR systems to flag high‑risk patients as soon as relevant events occur, enabling earlier interventions (e.g., proactive ICU preparation).
Modular, reusable architecture – Because the pipeline relies on standard event‑log formats and off‑the‑shelf ML libraries, developers can adapt it to other pathways (sepsis, stroke, post‑operative care) with minimal code changes.
Scalable monitoring – Prefix generation is linear in the number of events, and the models (especially Logistic Regression) are lightweight, making the approach feasible for large hospital networks or cloud‑based health analytics platforms.
Explainability – Linear models provide clear coefficient interpretations (e.g., “oxygen therapy increase doubles ICU risk”), which is valuable for clinicians and compliance teams.

Limitations & Future Work

Single‑center COVID‑19 data – Results may not generalize to other diseases, hospitals, or geographic regions without re‑training.
Static feature set – The current encoding does not exploit deep sequential models (e.g., LSTMs) that could capture richer temporal dependencies.
Outcome focus – Only ICU admission is predicted; extending to multi‑label outcomes (mortality, length of stay) would broaden utility.
Operational integration – The paper stops short of a live deployment study; future work could evaluate latency, user acceptance, and impact on patient outcomes in a production setting.

Bottom line for developers: This paper delivers a plug‑and‑play pipeline that turns messy health data into a continuously updated risk score, using familiar ML tools and a process‑mining mindset. If you’re building AI‑driven health dashboards, alerting systems, or any application that needs to “listen” to a patient’s journey in real time, the methodology and open‑source artifacts presented here are a solid foundation to start from.

Authors

Pasquale Ardimento
Mario Luca Bernardi
Marta Cimitile
Samuele Latorre

Paper Information

arXiv ID: 2605.03895v1
Categories: cs.LG, cs.SE
Published: May 5, 2026
PDF: Download PDF

[Paper] From Data Lifting to Continuous Risk Estimation: A Process-Aware Pipeline for Predictive Monitoring of Clinical Pathways

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Normalizing Trajectory Models

[Paper] Zero-Shot Imagined Speech Decoding via Imagined-to-Listened MEG Mapping

[Paper] GRAPHLCP: Structure-Aware Localized Conformal Prediction on Graphs

[Paper] EmambaIR: Efficient Visual State Space Model for Event-guided Image Reconstruction