[Paper] Exploring LLM Features in Predictive Process Monitoring for Small-Scale Event-Logs
Source: arXiv - 2601.11468v1
Overview
The paper investigates how large language models (LLMs) can be used for Predictive Process Monitoring—the task of forecasting future outcomes (e.g., remaining time, next activity) of a business process while it is running. By extending a prior LLM‑based framework, the authors show that even with tiny event logs (≈100 traces), an LLM can outperform traditional machine‑learning baselines across multiple key performance indicators (KPIs).
Key Contributions
- Generalized LLM framework that handles both total‑time prediction and activity‑occurrence prediction via natural‑language prompting.
- Empirical evidence that LLMs beat state‑of‑the‑art baselines in data‑scarce scenarios on three real‑world event logs.
- Analysis of semantic leverage, demonstrating that the model draws on its pre‑trained world knowledge (process semantics, temporal reasoning) in addition to patterns in the limited training data.
- Interpretation of reasoning strategies, showing the LLM performs higher‑order reasoning rather than merely memorizing or copying existing predictive methods.
Methodology
- Dataset preparation – Three publicly available event logs (e.g., BPI Challenge logs) were truncated to 100 traces to simulate a low‑data environment. Each trace contains a sequence of activities with timestamps.
- Prompt design – For each KPI, a concise natural‑language prompt was crafted (e.g., “Given the following partial execution of a loan‑approval process, predict the total remaining time”). The trace data are embedded directly into the prompt as a short textual description.
- Model fine‑tuning vs. zero‑shot – The authors experimented with (a) a few‑shot fine‑tuning of a GPT‑style LLM on the 100‑trace training set, and (b) pure zero‑shot prompting using the pre‑trained model.
- Baselines – Classical process‑mining predictors (e.g., transition‑system based, random forest, LSTM) trained on the same limited data served as benchmarks.
- Evaluation metrics – Mean Absolute Error (MAE) for total‑time prediction and F1‑score for activity‑occurrence prediction. Statistical significance was assessed via paired t‑tests.
- Reasoning analysis – Prompt‑engineering experiments (e.g., “Explain your reasoning”) and attention‑weight inspection were used to infer whether the LLM was leveraging prior knowledge or merely fitting the training traces.
Results & Findings
| KPI | LLM (few‑shot) | LLM (zero‑shot) | Best baseline | Relative gain |
|---|---|---|---|---|
| Total time (MAE) | 3.2 h | 3.5 h | 4.8 h (LSTM) | ≈30 % lower error |
| Activity occurrence (F1) | 0.78 | 0.74 | 0.66 (Random Forest) | ≈12 % higher F1 |
- The LLM consistently outperformed all baselines when only 100 traces were available.
- Zero‑shot performance was already competitive, confirming that pre‑trained knowledge (e.g., typical process durations, causal relations) contributes meaningfully.
- Fine‑tuning added a modest boost, indicating that the model can quickly adapt to domain‑specific quirks.
- Qualitative probing showed the LLM often cites logical constraints (“activity X cannot follow Y”) that are not explicitly encoded in the training data, evidencing higher‑order reasoning.
Practical Implications
- Rapid deployment: Companies can start predictive monitoring with minimal historical data, reducing the “cold‑start” problem that plagues traditional ML pipelines.
- Lower engineering overhead: Instead of building custom feature extraction pipelines for each process, developers can feed raw event logs into a prompt and obtain predictions, leveraging the LLM as a “plug‑and‑play” predictor.
- Explainability: The ability to request natural‑language rationales from the model can aid compliance teams and process analysts who need to justify predictions.
- Cross‑process transfer: Because the LLM carries generic process semantics, it can be reused across different domains (e.g., finance, healthcare) with only a few examples, accelerating time‑to‑value.
Limitations & Future Work
- Scalability: Prompt length limits mean very long traces must be truncated or summarized, which could discard useful context.
- Cost & latency: Running large LLMs (especially fine‑tuned versions) incurs higher compute costs compared to lightweight classifiers.
- Robustness to noisy logs: The study used clean, well‑structured logs; real‑world event data often contain missing timestamps or mislabeled activities.
- Future directions proposed by the authors include: exploring retrieval‑augmented prompting to handle longer histories, integrating domain‑specific ontologies to improve reasoning fidelity, and benchmarking on larger, noisier datasets to assess robustness.
Authors
- Alessandro Padella
- Massimiliano de Leoni
- Marlon Dumas
Paper Information
- arXiv ID: 2601.11468v1
- Categories: cs.AI, cs.IT
- Published: January 16, 2026
- PDF: Download PDF