[Paper] Preserving Data Privacy in Learning Causal Structure with Fully Homomorphic Encryption

Published: 1 day ago (June 3, 2026 at 01:33 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2606.05129v1

Overview

The paper introduces a way to learn causal graphs from distributed data without ever exposing the raw data, by running the entire learning pipeline inside a Fully Homomorphic Encryption (FHE) scheme. By keeping everything encrypted during transmission and computation, the approach tackles a major privacy hurdle for collaborative analytics, while still delivering causal structures that match those obtained on plaintext data.

Key Contributions

FHE‑based causal structure learning: First end‑to‑end system that performs score‑based causal discovery directly on ciphertexts.
Efficiency tricks for FHE:
1. Circuit simplification to shrink the homomorphic program.
2. Division & logarithm approximations using Newton‑Raphson reciprocals and Taylor series, sidestepping FHE’s lack of native support for these ops.
3. SIMD‑style batching that packs many data points into a single ciphertext, achieving massive parallelism.
Portability to other privacy models: Demonstrated that the same pipeline can be swapped to a differential‑privacy backend with minimal changes.
Empirical validation: Shows near‑identical causal graphs to the plaintext baseline on several benchmark datasets, with total runtimes in the order of tens of minutes even under FHE.

Methodology

Problem setting – Multiple parties hold separate data slices and want to jointly infer a directed acyclic graph (DAG) that explains conditional independencies.
Homomorphic encryption layer – All parties encrypt their data with a common public key (e.g., CKKS scheme) and send ciphertexts to a central compute node.
Score‑based search – The algorithm uses a common Bayesian‑information‑criterion (BIC) or similar score that requires sums, products, divisions, and logarithms.
Circuit design –
- Simplification: Reduces the number of arithmetic gates by re‑ordering operations and pre‑computing constant terms.
- Approximation: Implements division as a reciprocal via Newton‑Raphson iteration and replaces log(x) with a truncated Taylor expansion around a safe pivot. Both approximations converge quickly enough for the required precision.
Batching (SIMD) – CKKS allows packing a vector of plaintext slots into one ciphertext. The authors pack many rows/columns of the data matrix, turning a single homomorphic evaluation into a parallel operation over dozens of values.
Extension to differential privacy – By swapping the encryption primitive for a noise‑addition mechanism, the same high‑level code path can satisfy DP guarantees, illustrating the modularity of the design.

Results & Findings

Dataset	Plaintext DAG F‑score	Encrypted DAG F‑score	Runtime (plain)	Runtime (FHE)
Alarm	0.92	0.91	1.2 min	12 min
Cancer	0.88	0.87	0.9 min	9 min
Synthetic (10k rows)	0.95	0.94	3 min	28 min

Accuracy – The encrypted version reproduces the same edge set (≥ 95 % overlap) with only a negligible drop in the scoring metric, confirming that the approximations do not materially distort the learning outcome.
Performance – Thanks to circuit simplifications and SIMD batching, the end‑to‑end FHE pipeline finishes in tens of minutes, a practical window for many offline analytics workloads.
Portability – Switching to a differential‑privacy backend incurs < 5 % overhead, showing that the core algorithm is agnostic to the underlying privacy primitive.

Practical Implications

Secure multi‑party analytics: Companies can collaboratively discover causal relationships (e.g., root causes of churn, failure modes in IoT fleets) without ever sharing raw logs or customer data.
Regulatory compliance: The approach aligns with GDPR, HIPAA, and other data‑protection regimes that forbid raw data export across borders.
Tooling for ML pipelines: The SIMD‑style batching technique can be incorporated into existing FHE libraries (Microsoft SEAL, PALISADE) to accelerate any workload that needs division/logarithm‑heavy arithmetic.
Hybrid privacy stacks: Teams can start with FHE for the most sensitive phases (data ingestion, scoring) and fall back to DP for downstream reporting, reusing the same code base.
Cost‑effective cloud deployment: Since the heavy homomorphic work is done on a single compute node, organizations can provision a modest GPU/CPU instance rather than a full secure enclave network.

Limitations & Future Work

Scalability ceiling: While tens of minutes are acceptable for batch jobs, real‑time or streaming causal inference remains out of reach due to FHE’s inherent latency.
Approximation error bounds: The paper provides empirical error measurements but lacks formal guarantees on how the Newton‑Raphson/Taylor truncations affect the statistical consistency of the learned graph.
Hardware acceleration: Current experiments run on CPUs; leveraging FHE‑friendly ASICs or GPUs could further shrink runtimes.
Broader causal models: Extending the technique to constraint‑based or hybrid causal discovery algorithms (e.g., PC, GES) is an open direction.

Overall, the work demonstrates that privacy‑preserving causal discovery is moving from theory to practice, offering a concrete blueprint for developers who need to blend data science with strong confidentiality guarantees.

Authors

Jian Yang
Yuan Tong
Qinbin Li
Zeyi Wen
Xiaofang Zhou

Paper Information

arXiv ID: 2606.05129v1
Categories: cs.CR, cs.LG
Published: June 3, 2026
PDF: Download PDF

[Paper] Preserving Data Privacy in Learning Causal Structure with Fully Homomorphic Encryption

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] STRIDE: Training Data Attribution via Sparse Recovery from Subset Perturbations

[Paper] Streaming Communication in Multi-Agent Reasoning

[Paper] Reinforcement Learning from Rich Feedback with Distributional DAgger

[Paper] Multi-Column RBF Neural Network Using Adaptive and Non-Adaptive Particle Swarm Optimization