[Paper] Exploring LLM Features in Predictive Process Monitoring for Small-Scale Event-Logs

Published: (January 16, 2026 at 12:54 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2601.11468v1

Overview

The paper investigates how large language models (LLMs) can be used for Predictive Process Monitoring—the task of forecasting future outcomes (e.g., remaining time, next activity) of a business process while it is running. By extending a prior LLM‑based framework, the authors show that even with tiny event logs (≈100 traces), an LLM can outperform traditional machine‑learning baselines across multiple key performance indicators (KPIs).

Key Contributions

  • Generalized LLM framework that handles both total‑time prediction and activity‑occurrence prediction via natural‑language prompting.
  • Empirical evidence that LLMs beat state‑of‑the‑art baselines in data‑scarce scenarios on three real‑world event logs.
  • Analysis of semantic leverage, demonstrating that the model draws on its pre‑trained world knowledge (process semantics, temporal reasoning) in addition to patterns in the limited training data.
  • Interpretation of reasoning strategies, showing the LLM performs higher‑order reasoning rather than merely memorizing or copying existing predictive methods.

Methodology

  1. Dataset preparation – Three publicly available event logs (e.g., BPI Challenge logs) were truncated to 100 traces to simulate a low‑data environment. Each trace contains a sequence of activities with timestamps.
  2. Prompt design – For each KPI, a concise natural‑language prompt was crafted (e.g., “Given the following partial execution of a loan‑approval process, predict the total remaining time”). The trace data are embedded directly into the prompt as a short textual description.
  3. Model fine‑tuning vs. zero‑shot – The authors experimented with (a) a few‑shot fine‑tuning of a GPT‑style LLM on the 100‑trace training set, and (b) pure zero‑shot prompting using the pre‑trained model.
  4. Baselines – Classical process‑mining predictors (e.g., transition‑system based, random forest, LSTM) trained on the same limited data served as benchmarks.
  5. Evaluation metrics – Mean Absolute Error (MAE) for total‑time prediction and F1‑score for activity‑occurrence prediction. Statistical significance was assessed via paired t‑tests.
  6. Reasoning analysis – Prompt‑engineering experiments (e.g., “Explain your reasoning”) and attention‑weight inspection were used to infer whether the LLM was leveraging prior knowledge or merely fitting the training traces.

Results & Findings

KPILLM (few‑shot)LLM (zero‑shot)Best baselineRelative gain
Total time (MAE)3.2 h3.5 h4.8 h (LSTM)≈30 % lower error
Activity occurrence (F1)0.780.740.66 (Random Forest)≈12 % higher F1
  • The LLM consistently outperformed all baselines when only 100 traces were available.
  • Zero‑shot performance was already competitive, confirming that pre‑trained knowledge (e.g., typical process durations, causal relations) contributes meaningfully.
  • Fine‑tuning added a modest boost, indicating that the model can quickly adapt to domain‑specific quirks.
  • Qualitative probing showed the LLM often cites logical constraints (“activity X cannot follow Y”) that are not explicitly encoded in the training data, evidencing higher‑order reasoning.

Practical Implications

  • Rapid deployment: Companies can start predictive monitoring with minimal historical data, reducing the “cold‑start” problem that plagues traditional ML pipelines.
  • Lower engineering overhead: Instead of building custom feature extraction pipelines for each process, developers can feed raw event logs into a prompt and obtain predictions, leveraging the LLM as a “plug‑and‑play” predictor.
  • Explainability: The ability to request natural‑language rationales from the model can aid compliance teams and process analysts who need to justify predictions.
  • Cross‑process transfer: Because the LLM carries generic process semantics, it can be reused across different domains (e.g., finance, healthcare) with only a few examples, accelerating time‑to‑value.

Limitations & Future Work

  • Scalability: Prompt length limits mean very long traces must be truncated or summarized, which could discard useful context.
  • Cost & latency: Running large LLMs (especially fine‑tuned versions) incurs higher compute costs compared to lightweight classifiers.
  • Robustness to noisy logs: The study used clean, well‑structured logs; real‑world event data often contain missing timestamps or mislabeled activities.
  • Future directions proposed by the authors include: exploring retrieval‑augmented prompting to handle longer histories, integrating domain‑specific ontologies to improve reasoning fidelity, and benchmarking on larger, noisier datasets to assess robustness.

Authors

  • Alessandro Padella
  • Massimiliano de Leoni
  • Marlon Dumas

Paper Information

  • arXiv ID: 2601.11468v1
  • Categories: cs.AI, cs.IT
  • Published: January 16, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »