[Paper] A TEE-based Approach for Preserving Data Secrecy in Process Mining with Decentralized Sources
Source: arXiv - 2602.04697v1
Overview
Process mining is becoming a go‑to technique for turning raw event logs into actionable process insights. When those logs are scattered across multiple independent companies, however, sharing data can expose sensitive business information. The paper introduces CONFINE, a framework that uses Trusted Execution Environments (TEEs) to let several parties collaboratively mine their logs while keeping each organization’s raw data secret.
Key Contributions
- TEE‑based secrecy preservation – Deploys a trusted application inside a TEE (e.g., Intel SGX) that can ingest and process multi‑party logs without ever exposing the clear‑text data to the host OS or other participants.
- Four‑stage secure protocol – Defines a complete end‑to‑end workflow (provisioning, secure transfer, aggregation, result release) that guarantees confidentiality and integrity of the exchanged logs.
- Segmentation strategy for limited enclave memory – Breaks large logs into small batches that fit inside the enclave, preventing out‑of‑memory crashes while preserving the semantics of the mining algorithm.
- Formal verification & security analysis – Uses model‑checking to prove protocol correctness and evaluates the TEE’s threat model to show that data leakage is infeasible under realistic attacker capabilities.
- Scalable prototype evaluation – Demonstrates logarithmic memory growth with log size and linear growth with the number of participating organizations on both synthetic and real‑world datasets.
Methodology
- Architecture – A central orchestrator coordinates the mining job, but the actual computation runs inside a TEE on a cloud node. Each organization runs a lightweight client that encrypts its log and streams it to the enclave.
- Secure Data Exchange – The protocol uses mutual attestation (both parties prove they run genuine TEEs) followed by a Diffie‑Hellman key exchange to derive a session key. Logs are then sent in encrypted chunks.
- Batch Processing – Inside the enclave, the log is reconstructed chunk‑by‑chunk. The mining algorithm (e.g., discovery of a process model) works on the incremental view, updating internal data structures without ever storing the full log in memory.
- Result Release – Once processing finishes, the enclave signs the mined model and sends it back. The signature proves that the result was produced inside a verified TEE and that no raw data was leaked.
- Verification – The authors model the protocol in the TLA+ language and automatically check safety properties (no data leakage) and liveness (the protocol eventually terminates).
The whole pipeline is implemented in Python/C++ with SGX SDK bindings, making it relatively easy to integrate into existing process‑mining stacks.
Results & Findings
| Scenario | Log Size | # Orgs | Memory (Enclave) | Runtime |
|---|---|---|---|---|
| Synthetic (linear) | 10 M events | 3 | 12 MB (≈ log₂ size) | 2.3 min |
| Real‑world (order‑to‑cash) | 2.4 M events | 5 | 8 MB | 1.1 min |
| Stress test (50 M events) | 50 M | 2 | 18 MB | 7.9 min |
- Memory grows logarithmically with the total number of events thanks to the batch‑wise processing.
- Runtime scales linearly with the number of participating organizations, as each adds an extra encrypted transfer step.
- The approach successfully mined standard process‑discovery models (e.g., BPMN, Petri nets) that were identical to those obtained from a non‑secure, centralized run, confirming functional equivalence.
Practical Implications
- Secure SaaS Process‑Mining – Vendors can now offer cloud‑based mining services without requiring clients to hand over raw logs, opening doors to cross‑company analytics (supply‑chain, finance, healthcare).
- Compliance‑by‑Design – The TEE guarantees that data never leaves the enclave in clear text, helping organizations meet GDPR, CCPA, or industry‑specific confidentiality clauses.
- Plug‑and‑Play Integration – Because the client side is just a thin encryption wrapper, existing log exporters (e.g., from ERP or BPM systems) can be retrofitted with minimal code changes.
- Cost‑Effective Scaling – The logarithmic memory footprint means a single modest‑size cloud VM can handle multi‑gigabyte logs, reducing infrastructure spend compared to heavyweight homomorphic‑encryption alternatives.
Developers can start experimenting by pulling the open‑source CONFINE prototype, swapping the SGX enclave for any TEE that supports remote attestation (e.g., AMD SEV, ARM TrustZone), and plugging in their favorite process‑mining library.
Limitations & Future Work
- TEE Trust Assumptions – Security hinges on the integrity of the underlying hardware and firmware; side‑channel attacks (e.g., cache‑timing) are not fully mitigated.
- Network Overhead – Encrypting and transmitting logs in many small batches adds latency, especially over high‑latency links.
- Algorithm Scope – The current implementation focuses on discovery algorithms; conformance checking, predictive analytics, or deep‑learning‑based mining are not yet supported.
- Future Directions – The authors plan to (1) integrate side‑channel hardened enclaves, (2) explore adaptive batch sizing to reduce round‑trips, and (3) extend the framework to support federated learning‑style process‑model refinement across many more participants.
Authors
- Davide Basile
- Valerio Goretti
- Luca Barbaro
- Hajo A. Reijers
- Claudio Di Ciccio
Paper Information
- arXiv ID: 2602.04697v1
- Categories: cs.DC
- Published: February 4, 2026
- PDF: Download PDF