[Paper] Emulation of Complex Matrix Multiplication based on the Chinese Remainder Theorem
Source: arXiv - 2512.08321v1
Overview
Modern GPUs and AI accelerators ship with ultra‑fast INT8 (8‑bit integer) matrix‑multiply units, but many scientific and graphics workloads still need single‑ or double‑precision complex arithmetic. This paper shows how to emulate high‑precision complex matrix multiplication on INT8 hardware using a clever number‑theoretic trick (the Chinese Remainder Theorem) built on the existing Ozaki‑II framework. On an NVIDIA B200 GPU the authors report 4–6× speedups over cuBLAS’s native complex GEMM, opening the door for faster, more power‑efficient HPC and signal‑processing pipelines.
Key Contributions
- Extension of Ozaki‑II to complex numbers: Adapts the integer‑based emulation scheme to handle complex‑valued matrices in both single (FP32) and double (FP64) precision.
- CRT‑driven decomposition: Uses the Chinese Remainder Theorem to split high‑precision operands into several low‑precision INT8 fragments that can be processed in parallel on existing tensor cores.
- High‑throughput kernels for NVIDIA GPUs: Implements and tunes the approach for the B200 architecture, achieving 4.0–5.6× (FP32) and 4.4–6.5× (FP64) speedups versus cuBLAS’s
cgemm/zgemm. - Accuracy‑vs‑throughput trade‑off knobs: Provides configurable parameters that let developers choose between higher speed (accepting a small loss in numerical fidelity) or higher accuracy (with only modest overhead).
- Comprehensive evaluation: Benchmarks across a range of matrix sizes, precision levels, and error metrics, demonstrating that the method can both outperform and surpass cuBLAS in accuracy when desired.
Methodology
-
Number‑theoretic decomposition:
- Each element of a high‑precision complex matrix is represented as a sum of several INT8 “residues” modulo carefully chosen bases.
- The Chinese Remainder Theorem guarantees that the original value can be perfectly reconstructed from these residues.
-
Parallel low‑precision GEMM:
- The residues are fed to the GPU’s INT8 tensor cores, which execute many small GEMM operations simultaneously.
- Because the tensor cores are heavily pipelined, the overall latency is dominated by data movement rather than arithmetic.
-
Reconstruction & error correction:
- After the INT8 GEMMs finish, the partial results are combined using CRT‑based recombination formulas.
- Optional refinement steps (e.g., Kahan summation) can be applied to tighten the final rounding error.
-
Tuning parameters:
- The number of CRT bases, scaling factors, and the choice of whether to apply a refinement step are exposed as runtime options, letting developers balance speed vs. precision.
The whole pipeline is implemented as a drop‑in replacement for cuBLAS’s cgemm/zgemm, requiring only a header‑only library change in most codebases.
Results & Findings
| Precision | Speedup vs. cuBLAS | Typical Relative Error (ℓ₂) |
|---|---|---|
| FP32 (complex) | 4.0× – 5.6× | ≤ 1.2 × 10⁻⁶ (comparable to native) |
| FP64 (complex) | 4.4× – 6.5× | ≤ 2.5 × 10⁻⁹ (often better than cuBLAS) |
- Large matrices (≥ 4096 × 4096) reap the full benefit; smaller problems are limited by kernel launch overhead.
- When the user relaxes the error bound (e.g., tolerating 1e‑4 relative error), the same kernels can run up to 8× faster.
- Adding a single refinement pass raises accuracy by an order of magnitude while costing < 15 % extra runtime.
- Power measurements on the B200 show ≈30 % lower energy per GEMM compared with native FP64 complex kernels.
Practical Implications
- HPC & scientific simulations: Large‑scale quantum chemistry, electromagnetic solvers, and CFD codes that rely on complex linear algebra can cut wall‑clock time dramatically without sacrificing numerical stability.
- Signal processing & communications: Real‑time MIMO‑OFDM, radar imaging, and beamforming pipelines often operate on complex matrices; the proposed method enables higher throughput on edge GPUs or AI accelerators.
- Deep learning frameworks: Some emerging models (e.g., complex‑valued neural nets) can now be trained or inferred with INT8‑based back‑ends, reducing memory bandwidth and inference latency.
- Developer ergonomics: Because the library mimics the cuBLAS API, existing code can be switched to the faster path with minimal refactoring—just link against the new library and optionally set environment variables to tune accuracy.
Limitations & Future Work
- Hardware dependence: The current implementation is tuned for NVIDIA’s B200 tensor cores; performance on other GPUs or dedicated AI chips may vary and will need retuning.
- Small‑matrix overhead: For problem sizes below ~2K × 2K the launch and CRT reconstruction costs dominate, making the method less attractive.
- Precision ceiling: While the CRT approach can theoretically reconstruct any IEEE‑754 value, the practical error is bounded by the chosen bases; extreme dynamic ranges may require more residues, reducing speedups.
- Future directions: Extending the scheme to mixed‑precision workflows (e.g., FP16‑complex), auto‑tuning of CRT parameters per workload, and integrating the technique into mainstream libraries like cuBLAS or oneAPI BLAS.
Bottom line: By turning cheap INT8 tensor cores into a high‑precision complex matrix‑multiply engine, this work offers developers a practical, drop‑in performance boost for a wide range of compute‑heavy applications—especially where large, complex‑valued linear algebra dominates the workload.
Authors
- Yuki Uchino
- Qianxiang Ma
- Toshiyuki Imamura
- Katsuhisa Ozaki
- Patrick Lars Gutsche
Paper Information
- arXiv ID: 2512.08321v1
- Categories: cs.DC
- Published: December 9, 2025
- PDF: Download PDF