[Paper] Leveraging SIMD for Accelerating Large-number Arithmetic

Published: 1 day ago (April 23, 2026 at 07:38 AM EDT)

4 min read

Source: arXiv

Source: arXiv - 2604.21566v1

Overview

The paper introduces DigitsOnTurbo (DoT), a novel way to harness SIMD (single‑instruction‑multiple‑data) units on modern CPUs for big‑integer arithmetic. By redesigning the algorithms rather than merely “vectorizing” existing ones, DoT delivers noticeable speedups for the core operations that power scientific simulations and cryptographic protocols.

Key Contributions

Algorithmic redesign: Re‑architects addition, subtraction, and multiplication to expose independent, data‑parallel work that maps naturally onto SIMD lanes.
Performance gains: Shows up to 1.85× faster SIMD addition/subtraction and 2.3× faster multiplication compared with prior SIMD attempts.
Library integration: When plugged into leading big‑integer libraries, DoT yields up to 4× speedup for addition/subtraction and 2× for multiplication, translating into ≈19 % end‑to‑end throughput improvement for scientific workloads.
Cryptographic impact: Demonstrates 7.9 % latency reduction and 5.9 % throughput boost on real‑world cryptographic primitives (e.g., RSA, ECC).
Open‑source prototype: Provides a reference implementation that can be adopted by other developers and benchmarked on commodity CPUs.

Methodology

Data layout transformation – Numbers are stored as arrays of fixed‑size “digit” blocks that align with SIMD register widths (e.g., 128‑ or 256‑bit vectors).
Dependency elimination – Traditional big‑integer algorithms propagate carries or borrows sequentially. DoT replaces these with parallel prefix (scan) operations that compute carries across the whole vector in a few SIMD steps.
Vector‑friendly multiplication – Uses a schoolbook approach but batches the partial products into SIMD lanes, then applies a parallel reduction to combine them, avoiding the serial carry chain.
Implementation details – Hand‑written intrinsics (AVX2/AVX‑512) for the core kernels, with fallback scalar paths. The authors also built a thin abstraction layer so the same code can be compiled for different SIMD widths.
Benchmark suite – Evaluated on Intel Xeon Cascade Lake (AVX‑512) and AMD Zen 4 (AVX2) using micro‑benchmarks (single‑operation latency/throughput) and macro‑benchmarks (matrix multiplication, FFT, RSA key generation).

Results & Findings

Operation	Prior SIMD (baseline)	DoT SIMD	Speedup vs. Baseline	Library‑level gain*
Add / Sub	0.84 ns / digit	0.45 ns	1.85×	4× (when integrated)
Multiply	3.6 ns / digit‑pair	1.56 ns	2.30×	2× (library)
End‑to‑end scientific kernel (e.g., large‑matrix multiply)	–	–	+19.3 % throughput	–
RSA‑2048 sign/verify	–	–	‑7.9 % latency	+5.9 % throughput

* Gains are measured after swapping the library’s default big‑integer backend with DoT‑enabled kernels.

Key takeaways

The parallel‑prefix technique removes the classic carry‑propagation bottleneck, making SIMD lanes fully utilized.
Even with modest vector widths (AVX2), DoT still outperforms earlier SIMD attempts, proving the approach is architecture‑agnostic.
Real‑world applications see single‑digit speedups that compound into measurable end‑to‑end improvements.

Practical Implications

Cryptography libraries (OpenSSL, libsodium, BoringSSL) can adopt DoT kernels to shave latency off RSA/ECDSA operations, which matters for TLS handshakes and blockchain validation.
Scientific computing frameworks (NumPy, SciPy, Julia) that rely on arbitrary‑precision arithmetic for high‑precision simulations (e.g., orbital mechanics, quantum chemistry) can achieve higher throughput without moving to GPUs.
Performance‑critical services (certificate authorities, secure messaging servers) can reduce CPU cost per transaction, potentially lowering cloud spend.
The design pattern—rethink the algorithm for data‑parallelism rather than force‑fit—is reusable for other dependency‑heavy domains such as big‑integer division, modular exponentiation, or polynomial arithmetic.

Developers can start experimenting by pulling the open‑source DoT prototype, swapping the big‑integer backend in their build system, and benchmarking with their own workloads.

Limitations & Future Work

Carry‑scan overhead grows with the number of SIMD lanes; on very wide vectors (e.g., future AVX‑1024) the scan may become a new bottleneck.
The current implementation focuses on unsigned integers; signed arithmetic and mixed‑radix representations need additional handling.
Multiplication still follows a schoolbook scheme; integrating more advanced algorithms (Karatsuba, Toom‑Cook) with SIMD‑friendly scans is an open research direction.
GPU or accelerator extensions are not explored; extending DoT’s principles to heterogeneous platforms could unlock further gains.

Overall, DigitsOnTurbo demonstrates that a thoughtful algorithmic redesign can unlock SIMD’s latent power for big‑integer workloads, offering immediate benefits to developers building high‑performance, security‑sensitive software.

Authors

Subhrajit Das
Abhishek Bichhawat
Yuvraj Patel

Paper Information

arXiv ID: 2604.21566v1
Categories: cs.DC
Published: April 23, 2026
PDF: Download PDF

[Paper] Leveraging SIMD for Accelerating Large-number Arithmetic

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Systematizing Blockchain Research Themes and Design Patterns: Insights from the University Blockchain Research Initiative (UBRI)

[Paper] Risk-Aware and Stable Edge Server Selection Under Network Latency SLOs

[Paper] Research on the efficiency of data loading and storage in Data Lakehouse architectures for the formation of analytical data systems

[Paper] A Task Decomposition and Planning Framework for Efficient LLM Inference in AI-Enabled WiFi-Offload Networks