[Paper] An LLVM-Based Optimization Pipeline for SPDZ
Source: arXiv - 2512.11112v1
Overview
The paper presents a prototype compiler‑runtime stack that plugs the SPDZ secure‑multiparty computation (MPC) protocol into the LLVM ecosystem. By letting developers write ordinary (annotated) C code, the system automatically extracts parallelism, batches arithmetic operations, and overlaps communication with computation—yielding up to 5.5× speed‑ups on CPU and scalable GPU acceleration.
Key Contributions
- LLVM‑based front‑end that accepts a small, privacy‑annotated subset of C and lowers it to LLVM IR, reusing LLVM’s mature analyses.
- Automatic batching of independent arithmetic operations, removing the need for programmers to manually express parallelism.
- Protocol‑aware scheduler in the back‑end that performs data‑flow and control‑flow analysis to drive a non‑blocking runtime, overlapping network traffic with local computation.
- GPU off‑loading path that maps large batched arithmetic kernels to CUDA kernels when available.
- Empirical evaluation showing up to 5.56× speed‑up over MP‑SPDZ on CPU and strong scaling with thread count; GPU back‑end scales better for larger inputs.
Methodology
- Front‑end parsing – Developers write C code with lightweight annotations (e.g.,
@secret) to mark private values. The parser translates this into LLVM IR, preserving the annotations as metadata. - LLVM optimizations – Standard passes (dead‑code elimination, loop unrolling, etc.) run unchanged. Custom passes then detect independent arithmetic statements and group them into batches.
- Back‑end analysis – A data‑flow pass builds a dependency graph of the batched operations. A control‑flow pass identifies points where communication (sending/receiving secret shares) can be overlapped with independent local work.
- Runtime scheduler – The scheduler is non‑blocking: it issues network messages early, then continues executing any ready batches while awaiting replies. When a batch is large enough and a GPU is present, it is dispatched to a CUDA kernel.
- Evaluation – The authors benchmark a suite of micro‑benchmarks (matrix multiplication, polynomial evaluation, etc.) in the online phase of SPDZ, comparing against the state‑of‑the‑art MP‑SPDZ implementation on both CPU‑only and CPU+GPU configurations.
Results & Findings
| Configuration | Speed‑up vs. MP‑SPDZ | Scaling behavior |
|---|---|---|
| CPU, 1‑thread | 1.8× – 2.3× (light workloads) | Near‑linear up to 8 cores |
| CPU, 8‑threads | up to 5.56× (heavy algebra) | Strong scaling, diminishing returns after 16 threads |
| GPU (CUDA) | 2.5× – 4.0× over CPU‑only for large inputs | Improves as batch size grows; overhead negligible for small problems |
Key takeaways
- Automatic batching eliminates most of the manual parallelism engineering required by existing SPDZ toolchains.
- Non‑blocking scheduling hides network latency, especially beneficial when the underlying network is high‑latency but high‑bandwidth.
- GPU acceleration becomes worthwhile once the batched workload exceeds a few thousand arithmetic ops, matching the typical size of real‑world MPC tasks (e.g., privacy‑preserving ML inference).
Practical Implications
- Lower barrier to entry – Developers can now write familiar C code with simple annotations instead of learning domain‑specific languages or hand‑crafting parallel MPC pipelines.
- Faster production deployments – The observed speed‑ups translate directly into lower compute costs and tighter latency budgets for privacy‑preserving services (e.g., secure auctions, federated analytics).
- Hardware‑agnostic scaling – The same codebase can run efficiently on multi‑core CPUs or be upgraded to GPU‑accelerated clusters without code changes, enabling a smooth migration path as workloads grow.
- Potential for integration – Because the front‑end emits standard LLVM IR, existing toolchains (Clang, Rust‑LLVM back‑ends, etc.) could be extended to target SPDZ, opening the door for broader language support.
Limitations & Future Work
- Subset of C – Only a limited set of language features (straight‑line arithmetic, simple loops) is currently supported; complex data structures and dynamic memory are out of scope.
- Online‑phase focus – The evaluation concentrates on the online phase; offline preprocessing (pre‑computation of multiplication triples) is not accelerated.
- Prototype maturity – The system is a proof‑of‑concept; robustness, error handling, and integration with existing MPC frameworks need further engineering.
- Future directions – Extending the front‑end to full C/C++ (or other languages), adding support for other MPC protocols (e.g., BGV, CKKS), and exploring heterogeneous scheduling across CPU, GPU, and FPGA accelerators.
Authors
- Tianye Dai
- Hammurabi Mendes
- Heuichan Lim
Paper Information
- arXiv ID: 2512.11112v1
- Categories: cs.CR, cs.DC, cs.SE
- Published: December 11, 2025
- PDF: Download PDF