[Paper] Exploring LLM Features in Predictive Process Monitoring for Small-Scale Event-Logs

Published: 3 weeks ago (January 16, 2026 at 12:54 PM EST)

3 min read

Source: arXiv

Source: arXiv - 2601.11468v1

Overview

The paper investigates how large language models (LLMs) can be used for Predictive Process Monitoring—the task of forecasting future outcomes (e.g., remaining time, next activity) of a business process while it is running. By extending a prior LLM‑based framework, the authors show that even with tiny event logs (≈100 traces), an LLM can outperform traditional machine‑learning baselines across multiple key performance indicators (KPIs).

Key Contributions

Generalized LLM framework that handles both total‑time prediction and activity‑occurrence prediction via natural‑language prompting.
Empirical evidence that LLMs beat state‑of‑the‑art baselines in data‑scarce scenarios on three real‑world event logs.
Analysis of semantic leverage, demonstrating that the model draws on its pre‑trained world knowledge (process semantics, temporal reasoning) in addition to patterns in the limited training data.
Interpretation of reasoning strategies, showing the LLM performs higher‑order reasoning rather than merely memorizing or copying existing predictive methods.

Methodology

Dataset preparation – Three publicly available event logs (e.g., BPI Challenge logs) were truncated to 100 traces to simulate a low‑data environment. Each trace contains a sequence of activities with timestamps.
Prompt design – For each KPI, a concise natural‑language prompt was crafted (e.g., “Given the following partial execution of a loan‑approval process, predict the total remaining time”). The trace data are embedded directly into the prompt as a short textual description.
Model fine‑tuning vs. zero‑shot – The authors experimented with (a) a few‑shot fine‑tuning of a GPT‑style LLM on the 100‑trace training set, and (b) pure zero‑shot prompting using the pre‑trained model.
Baselines – Classical process‑mining predictors (e.g., transition‑system based, random forest, LSTM) trained on the same limited data served as benchmarks.
Evaluation metrics – Mean Absolute Error (MAE) for total‑time prediction and F1‑score for activity‑occurrence prediction. Statistical significance was assessed via paired t‑tests.
Reasoning analysis – Prompt‑engineering experiments (e.g., “Explain your reasoning”) and attention‑weight inspection were used to infer whether the LLM was leveraging prior knowledge or merely fitting the training traces.

Results & Findings

KPI	LLM (few‑shot)	LLM (zero‑shot)	Best baseline	Relative gain
Total time (MAE)	3.2 h	3.5 h	4.8 h (LSTM)	≈30 % lower error
Activity occurrence (F1)	0.78	0.74	0.66 (Random Forest)	≈12 % higher F1

The LLM consistently outperformed all baselines when only 100 traces were available.
Zero‑shot performance was already competitive, confirming that pre‑trained knowledge (e.g., typical process durations, causal relations) contributes meaningfully.
Fine‑tuning added a modest boost, indicating that the model can quickly adapt to domain‑specific quirks.
Qualitative probing showed the LLM often cites logical constraints (“activity X cannot follow Y”) that are not explicitly encoded in the training data, evidencing higher‑order reasoning.

Practical Implications

Rapid deployment: Companies can start predictive monitoring with minimal historical data, reducing the “cold‑start” problem that plagues traditional ML pipelines.
Lower engineering overhead: Instead of building custom feature extraction pipelines for each process, developers can feed raw event logs into a prompt and obtain predictions, leveraging the LLM as a “plug‑and‑play” predictor.
Explainability: The ability to request natural‑language rationales from the model can aid compliance teams and process analysts who need to justify predictions.
Cross‑process transfer: Because the LLM carries generic process semantics, it can be reused across different domains (e.g., finance, healthcare) with only a few examples, accelerating time‑to‑value.

Limitations & Future Work

Scalability: Prompt length limits mean very long traces must be truncated or summarized, which could discard useful context.
Cost & latency: Running large LLMs (especially fine‑tuned versions) incurs higher compute costs compared to lightweight classifiers.
Robustness to noisy logs: The study used clean, well‑structured logs; real‑world event data often contain missing timestamps or mislabeled activities.
Future directions proposed by the authors include: exploring retrieval‑augmented prompting to handle longer histories, integrating domain‑specific ontologies to improve reasoning fidelity, and benchmarking on larger, noisier datasets to assess robustness.

Authors

Alessandro Padella
Massimiliano de Leoni
Marlon Dumas

Paper Information

arXiv ID: 2601.11468v1
Categories: cs.AI, cs.IT
Published: January 16, 2026
PDF: Download PDF

[Paper] Exploring LLM Features in Predictive Process Monitoring for Small-Scale Event-Logs

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Do explanations generalize across large reasoning models?

[Paper] Building Production-Ready Probes For Gemini

[Paper] ShapeR: Robust Conditional 3D Shape Generation from Casual Captures

[Paper] MetaboNet: The Largest Publicly Available Consolidated Dataset for Type 1 Diabetes Management