[Paper] Leveraging SIMD for Accelerating Large-number Arithmetic

Published: (April 23, 2026 at 07:38 AM EDT)
4 min read
Source: arXiv

Source: arXiv - 2604.21566v1

Overview

The paper introduces DigitsOnTurbo (DoT), a novel way to harness SIMD (single‑instruction‑multiple‑data) units on modern CPUs for big‑integer arithmetic. By redesigning the algorithms rather than merely “vectorizing” existing ones, DoT delivers noticeable speedups for the core operations that power scientific simulations and cryptographic protocols.

Key Contributions

  • Algorithmic redesign: Re‑architects addition, subtraction, and multiplication to expose independent, data‑parallel work that maps naturally onto SIMD lanes.
  • Performance gains: Shows up to 1.85× faster SIMD addition/subtraction and 2.3× faster multiplication compared with prior SIMD attempts.
  • Library integration: When plugged into leading big‑integer libraries, DoT yields up to 4× speedup for addition/subtraction and for multiplication, translating into ≈19 % end‑to‑end throughput improvement for scientific workloads.
  • Cryptographic impact: Demonstrates 7.9 % latency reduction and 5.9 % throughput boost on real‑world cryptographic primitives (e.g., RSA, ECC).
  • Open‑source prototype: Provides a reference implementation that can be adopted by other developers and benchmarked on commodity CPUs.

Methodology

  1. Data layout transformation – Numbers are stored as arrays of fixed‑size “digit” blocks that align with SIMD register widths (e.g., 128‑ or 256‑bit vectors).
  2. Dependency elimination – Traditional big‑integer algorithms propagate carries or borrows sequentially. DoT replaces these with parallel prefix (scan) operations that compute carries across the whole vector in a few SIMD steps.
  3. Vector‑friendly multiplication – Uses a schoolbook approach but batches the partial products into SIMD lanes, then applies a parallel reduction to combine them, avoiding the serial carry chain.
  4. Implementation details – Hand‑written intrinsics (AVX2/AVX‑512) for the core kernels, with fallback scalar paths. The authors also built a thin abstraction layer so the same code can be compiled for different SIMD widths.
  5. Benchmark suite – Evaluated on Intel Xeon Cascade Lake (AVX‑512) and AMD Zen 4 (AVX2) using micro‑benchmarks (single‑operation latency/throughput) and macro‑benchmarks (matrix multiplication, FFT, RSA key generation).

Results & Findings

OperationPrior SIMD (baseline)DoT SIMDSpeedup vs. BaselineLibrary‑level gain*
Add / Sub0.84 ns / digit0.45 ns1.85× (when integrated)
Multiply3.6 ns / digit‑pair1.56 ns2.30× (library)
End‑to‑end scientific kernel (e.g., large‑matrix multiply)+19.3 % throughput
RSA‑2048 sign/verify‑7.9 % latency+5.9 % throughput

* Gains are measured after swapping the library’s default big‑integer backend with DoT‑enabled kernels.

Key takeaways

  • The parallel‑prefix technique removes the classic carry‑propagation bottleneck, making SIMD lanes fully utilized.
  • Even with modest vector widths (AVX2), DoT still outperforms earlier SIMD attempts, proving the approach is architecture‑agnostic.
  • Real‑world applications see single‑digit speedups that compound into measurable end‑to‑end improvements.

Practical Implications

  • Cryptography libraries (OpenSSL, libsodium, BoringSSL) can adopt DoT kernels to shave latency off RSA/ECDSA operations, which matters for TLS handshakes and blockchain validation.
  • Scientific computing frameworks (NumPy, SciPy, Julia) that rely on arbitrary‑precision arithmetic for high‑precision simulations (e.g., orbital mechanics, quantum chemistry) can achieve higher throughput without moving to GPUs.
  • Performance‑critical services (certificate authorities, secure messaging servers) can reduce CPU cost per transaction, potentially lowering cloud spend.
  • The design pattern—rethink the algorithm for data‑parallelism rather than force‑fit—is reusable for other dependency‑heavy domains such as big‑integer division, modular exponentiation, or polynomial arithmetic.

Developers can start experimenting by pulling the open‑source DoT prototype, swapping the big‑integer backend in their build system, and benchmarking with their own workloads.

Limitations & Future Work

  • Carry‑scan overhead grows with the number of SIMD lanes; on very wide vectors (e.g., future AVX‑1024) the scan may become a new bottleneck.
  • The current implementation focuses on unsigned integers; signed arithmetic and mixed‑radix representations need additional handling.
  • Multiplication still follows a schoolbook scheme; integrating more advanced algorithms (Karatsuba, Toom‑Cook) with SIMD‑friendly scans is an open research direction.
  • GPU or accelerator extensions are not explored; extending DoT’s principles to heterogeneous platforms could unlock further gains.

Overall, DigitsOnTurbo demonstrates that a thoughtful algorithmic redesign can unlock SIMD’s latent power for big‑integer workloads, offering immediate benefits to developers building high‑performance, security‑sensitive software.

Authors

  • Subhrajit Das
  • Abhishek Bichhawat
  • Yuvraj Patel

Paper Information

  • arXiv ID: 2604.21566v1
  • Categories: cs.DC
  • Published: April 23, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »