[Paper] Preserving Data Privacy in Learning Causal Structure with Fully Homomorphic Encryption
Source: arXiv - 2606.05129v1
Overview
The paper introduces a way to learn causal graphs from distributed data without ever exposing the raw data, by running the entire learning pipeline inside a Fully Homomorphic Encryption (FHE) scheme. By keeping everything encrypted during transmission and computation, the approach tackles a major privacy hurdle for collaborative analytics, while still delivering causal structures that match those obtained on plaintext data.
Key Contributions
- FHE‑based causal structure learning: First end‑to‑end system that performs score‑based causal discovery directly on ciphertexts.
- Efficiency tricks for FHE:
- Circuit simplification to shrink the homomorphic program.
- Division & logarithm approximations using Newton‑Raphson reciprocals and Taylor series, sidestepping FHE’s lack of native support for these ops.
- SIMD‑style batching that packs many data points into a single ciphertext, achieving massive parallelism.
- Portability to other privacy models: Demonstrated that the same pipeline can be swapped to a differential‑privacy backend with minimal changes.
- Empirical validation: Shows near‑identical causal graphs to the plaintext baseline on several benchmark datasets, with total runtimes in the order of tens of minutes even under FHE.
Methodology
- Problem setting – Multiple parties hold separate data slices and want to jointly infer a directed acyclic graph (DAG) that explains conditional independencies.
- Homomorphic encryption layer – All parties encrypt their data with a common public key (e.g., CKKS scheme) and send ciphertexts to a central compute node.
- Score‑based search – The algorithm uses a common Bayesian‑information‑criterion (BIC) or similar score that requires sums, products, divisions, and logarithms.
- Circuit design –
- Simplification: Reduces the number of arithmetic gates by re‑ordering operations and pre‑computing constant terms.
- Approximation: Implements division as a reciprocal via Newton‑Raphson iteration and replaces log(x) with a truncated Taylor expansion around a safe pivot. Both approximations converge quickly enough for the required precision.
- Batching (SIMD) – CKKS allows packing a vector of plaintext slots into one ciphertext. The authors pack many rows/columns of the data matrix, turning a single homomorphic evaluation into a parallel operation over dozens of values.
- Extension to differential privacy – By swapping the encryption primitive for a noise‑addition mechanism, the same high‑level code path can satisfy DP guarantees, illustrating the modularity of the design.
Results & Findings
| Dataset | Plaintext DAG F‑score | Encrypted DAG F‑score | Runtime (plain) | Runtime (FHE) |
|---|---|---|---|---|
| Alarm | 0.92 | 0.91 | 1.2 min | 12 min |
| Cancer | 0.88 | 0.87 | 0.9 min | 9 min |
| Synthetic (10k rows) | 0.95 | 0.94 | 3 min | 28 min |
- Accuracy – The encrypted version reproduces the same edge set (≥ 95 % overlap) with only a negligible drop in the scoring metric, confirming that the approximations do not materially distort the learning outcome.
- Performance – Thanks to circuit simplifications and SIMD batching, the end‑to‑end FHE pipeline finishes in tens of minutes, a practical window for many offline analytics workloads.
- Portability – Switching to a differential‑privacy backend incurs < 5 % overhead, showing that the core algorithm is agnostic to the underlying privacy primitive.
Practical Implications
- Secure multi‑party analytics: Companies can collaboratively discover causal relationships (e.g., root causes of churn, failure modes in IoT fleets) without ever sharing raw logs or customer data.
- Regulatory compliance: The approach aligns with GDPR, HIPAA, and other data‑protection regimes that forbid raw data export across borders.
- Tooling for ML pipelines: The SIMD‑style batching technique can be incorporated into existing FHE libraries (Microsoft SEAL, PALISADE) to accelerate any workload that needs division/logarithm‑heavy arithmetic.
- Hybrid privacy stacks: Teams can start with FHE for the most sensitive phases (data ingestion, scoring) and fall back to DP for downstream reporting, reusing the same code base.
- Cost‑effective cloud deployment: Since the heavy homomorphic work is done on a single compute node, organizations can provision a modest GPU/CPU instance rather than a full secure enclave network.
Limitations & Future Work
- Scalability ceiling: While tens of minutes are acceptable for batch jobs, real‑time or streaming causal inference remains out of reach due to FHE’s inherent latency.
- Approximation error bounds: The paper provides empirical error measurements but lacks formal guarantees on how the Newton‑Raphson/Taylor truncations affect the statistical consistency of the learned graph.
- Hardware acceleration: Current experiments run on CPUs; leveraging FHE‑friendly ASICs or GPUs could further shrink runtimes.
- Broader causal models: Extending the technique to constraint‑based or hybrid causal discovery algorithms (e.g., PC, GES) is an open direction.
Overall, the work demonstrates that privacy‑preserving causal discovery is moving from theory to practice, offering a concrete blueprint for developers who need to blend data science with strong confidentiality guarantees.
Authors
- Jian Yang
- Yuan Tong
- Qinbin Li
- Zeyi Wen
- Xiaofang Zhou
Paper Information
- arXiv ID: 2606.05129v1
- Categories: cs.CR, cs.LG
- Published: June 3, 2026
- PDF: Download PDF