[Paper] An LLVM-Based Optimization Pipeline for SPDZ

Published: (December 11, 2025 at 03:53 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.11112v1

Overview

The paper presents a prototype compiler‑runtime stack that plugs the SPDZ secure‑multiparty computation (MPC) protocol into the LLVM ecosystem. By letting developers write ordinary (annotated) C code, the system automatically extracts parallelism, batches arithmetic operations, and overlaps communication with computation—yielding up to 5.5× speed‑ups on CPU and scalable GPU acceleration.

Key Contributions

  • LLVM‑based front‑end that accepts a small, privacy‑annotated subset of C and lowers it to LLVM IR, reusing LLVM’s mature analyses.
  • Automatic batching of independent arithmetic operations, removing the need for programmers to manually express parallelism.
  • Protocol‑aware scheduler in the back‑end that performs data‑flow and control‑flow analysis to drive a non‑blocking runtime, overlapping network traffic with local computation.
  • GPU off‑loading path that maps large batched arithmetic kernels to CUDA kernels when available.
  • Empirical evaluation showing up to 5.56× speed‑up over MP‑SPDZ on CPU and strong scaling with thread count; GPU back‑end scales better for larger inputs.

Methodology

  1. Front‑end parsing – Developers write C code with lightweight annotations (e.g., @secret) to mark private values. The parser translates this into LLVM IR, preserving the annotations as metadata.
  2. LLVM optimizations – Standard passes (dead‑code elimination, loop unrolling, etc.) run unchanged. Custom passes then detect independent arithmetic statements and group them into batches.
  3. Back‑end analysis – A data‑flow pass builds a dependency graph of the batched operations. A control‑flow pass identifies points where communication (sending/receiving secret shares) can be overlapped with independent local work.
  4. Runtime scheduler – The scheduler is non‑blocking: it issues network messages early, then continues executing any ready batches while awaiting replies. When a batch is large enough and a GPU is present, it is dispatched to a CUDA kernel.
  5. Evaluation – The authors benchmark a suite of micro‑benchmarks (matrix multiplication, polynomial evaluation, etc.) in the online phase of SPDZ, comparing against the state‑of‑the‑art MP‑SPDZ implementation on both CPU‑only and CPU+GPU configurations.

Results & Findings

ConfigurationSpeed‑up vs. MP‑SPDZScaling behavior
CPU, 1‑thread1.8× – 2.3× (light workloads)Near‑linear up to 8 cores
CPU, 8‑threadsup to 5.56× (heavy algebra)Strong scaling, diminishing returns after 16 threads
GPU (CUDA)2.5× – 4.0× over CPU‑only for large inputsImproves as batch size grows; overhead negligible for small problems

Key takeaways

  • Automatic batching eliminates most of the manual parallelism engineering required by existing SPDZ toolchains.
  • Non‑blocking scheduling hides network latency, especially beneficial when the underlying network is high‑latency but high‑bandwidth.
  • GPU acceleration becomes worthwhile once the batched workload exceeds a few thousand arithmetic ops, matching the typical size of real‑world MPC tasks (e.g., privacy‑preserving ML inference).

Practical Implications

  • Lower barrier to entry – Developers can now write familiar C code with simple annotations instead of learning domain‑specific languages or hand‑crafting parallel MPC pipelines.
  • Faster production deployments – The observed speed‑ups translate directly into lower compute costs and tighter latency budgets for privacy‑preserving services (e.g., secure auctions, federated analytics).
  • Hardware‑agnostic scaling – The same codebase can run efficiently on multi‑core CPUs or be upgraded to GPU‑accelerated clusters without code changes, enabling a smooth migration path as workloads grow.
  • Potential for integration – Because the front‑end emits standard LLVM IR, existing toolchains (Clang, Rust‑LLVM back‑ends, etc.) could be extended to target SPDZ, opening the door for broader language support.

Limitations & Future Work

  • Subset of C – Only a limited set of language features (straight‑line arithmetic, simple loops) is currently supported; complex data structures and dynamic memory are out of scope.
  • Online‑phase focus – The evaluation concentrates on the online phase; offline preprocessing (pre‑computation of multiplication triples) is not accelerated.
  • Prototype maturity – The system is a proof‑of‑concept; robustness, error handling, and integration with existing MPC frameworks need further engineering.
  • Future directions – Extending the front‑end to full C/C++ (or other languages), adding support for other MPC protocols (e.g., BGV, CKKS), and exploring heterogeneous scheduling across CPU, GPU, and FPGA accelerators.

Authors

  • Tianye Dai
  • Hammurabi Mendes
  • Heuichan Lim

Paper Information

  • arXiv ID: 2512.11112v1
  • Categories: cs.CR, cs.DC, cs.SE
  • Published: December 11, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »