[Paper] From Data Lifting to Continuous Risk Estimation: A Process-Aware Pipeline for Predictive Monitoring of Clinical Pathways

Published: (May 5, 2026 at 11:51 AM EDT)
4 min read
Source: arXiv

Source: arXiv - 2605.03895v1

Overview

The authors introduce a process‑aware predictive‑monitoring pipeline that continuously estimates patient risk as clinical pathways unfold. By turning raw electronic health records into temporally ordered event logs and feeding them to standard machine‑learning models, the framework can update risk scores in real time—demonstrated on COVID‑19 ICU admission prediction.

Key Contributions

  • End‑to‑end reproducible pipeline that bridges raw health data (“data lifting”) to process‑aware predictive models.
  • Temporal reconstruction of patient journeys, turning irregular timestamps into ordered event prefixes suitable for incremental prediction.
  • Prefix‑based representation that captures the “what‑has‑happened‑so‑far” state of each case, enabling continuous risk estimation.
  • Empirical evaluation on a large COVID‑19 cohort (4,479 patients, 46,804 prefixes) showing strong early‑warning performance (AUC ≈ 0.90).
  • Insightful analysis of signal emergence, proving that predictive power grows as more clinical events become available.

Methodology

  1. Data Lifting – Raw EHR tables (lab results, procedures, vitals) are flattened into a unified schema of events (e.g., “oxygen therapy started”).
  2. Temporal Reconstruction – Each patient’s timestamps are sorted, gaps are filled, and a case timeline is built.
  3. Event Log Construction – The timeline is transformed into an event log (a standard artifact in process mining) where each row is a (case‑id, activity, timestamp).
  4. Prefix Generation – For every case, all possible prefixes are extracted (e.g., after 1st event, after 2nd event, …). Each prefix represents the patient’s state at a given moment.
  5. Feature Engineering – Prefixes are encoded using a mix of:
    • One‑hot activity counts (how many times each clinical activity has occurred)
    • Temporal features (time since admission, time since last event)
    • Aggregated clinical measurements (latest lab values, moving averages)
  6. Predictive Modeling – Conventional classifiers (Logistic Regression, Random Forest, XGBoost) are trained on the prefix features to predict the binary target ICU admission. A case‑level split ensures that all prefixes of a patient stay in either training or test set, avoiding leakage.
  7. Evaluation – Metrics (AUC, F1‑score) are computed per‑prefix length to assess how early reliable predictions can be made.

Results & Findings

ModelOverall AUCOverall F1
Logistic Regression0.9060.835
Random Forest0.8890.812
XGBoost0.9020.828
  • Early‑stage performance: With only the first few events, AUC ≈ 0.64 – still better than random, indicating that even minimal information carries signal.
  • Mid‑stage performance: After ~5 events, AUC climbs to ≈ 0.80.
  • Late‑stage performance: Near the end of the pathway, AUC reaches 0.94, showing that the model can become highly confident when more data is available.

The analysis confirms two central observations:

  1. Predictive signals emerge gradually; the more of the patient journey we observe, the sharper the risk estimate.
  2. Process‑aware representations (prefixes) are crucial for capturing the evolving context, outperforming naïve “snapshot” models that ignore temporal ordering.

Practical Implications

  • Real‑time clinical decision support – Hospitals can embed the pipeline into their EHR systems to flag high‑risk patients as soon as relevant events occur, enabling earlier interventions (e.g., proactive ICU preparation).
  • Modular, reusable architecture – Because the pipeline relies on standard event‑log formats and off‑the‑shelf ML libraries, developers can adapt it to other pathways (sepsis, stroke, post‑operative care) with minimal code changes.
  • Scalable monitoring – Prefix generation is linear in the number of events, and the models (especially Logistic Regression) are lightweight, making the approach feasible for large hospital networks or cloud‑based health analytics platforms.
  • Explainability – Linear models provide clear coefficient interpretations (e.g., “oxygen therapy increase doubles ICU risk”), which is valuable for clinicians and compliance teams.

Limitations & Future Work

  • Single‑center COVID‑19 data – Results may not generalize to other diseases, hospitals, or geographic regions without re‑training.
  • Static feature set – The current encoding does not exploit deep sequential models (e.g., LSTMs) that could capture richer temporal dependencies.
  • Outcome focus – Only ICU admission is predicted; extending to multi‑label outcomes (mortality, length of stay) would broaden utility.
  • Operational integration – The paper stops short of a live deployment study; future work could evaluate latency, user acceptance, and impact on patient outcomes in a production setting.

Bottom line for developers: This paper delivers a plug‑and‑play pipeline that turns messy health data into a continuously updated risk score, using familiar ML tools and a process‑mining mindset. If you’re building AI‑driven health dashboards, alerting systems, or any application that needs to “listen” to a patient’s journey in real time, the methodology and open‑source artifacts presented here are a solid foundation to start from.

Authors

  • Pasquale Ardimento
  • Mario Luca Bernardi
  • Marta Cimitile
  • Samuele Latorre

Paper Information

  • arXiv ID: 2605.03895v1
  • Categories: cs.LG, cs.SE
  • Published: May 5, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] Normalizing Trajectory Models

Diffusion-based models decompose sampling into many small Gaussian denoising steps -- an assumption that breaks down when generation is compressed to a few coar...