[Paper] Six Times to Spare: LDPC Acceleration on DGX Spark for AI-Native Open RAN
Source: arXiv - 2602.04652v1
Overview
This paper measures how much faster 5G‑style LDPC decoding can run when it’s moved from the CPU cores of NVIDIA’s Grace CPU to the integrated Blackwell GB10 GPU on a DGX Spark system. By building a realistic 5G link‑level chain with TensorFlow‑based Sionna components, the authors show a ~6× throughput boost and a dramatic reduction in latency, making the decoder fit comfortably inside the 0.5 ms slot budget that real‑world base stations must meet.
Key Contributions
- Empirical benchmark of LDPC5G decoding on Grace CPU vs. Blackwell GPU across a range of parallel codewords and belief‑propagation iterations.
- Quantified speedup: average ~6× higher throughput; CPU latency can exceed the 0.5 ms slot (≈0.71 ms at 20 iterations) while the GPU stays within 6–24 % of the slot.
- Resource‑usage profiling: CPU decoding consumes ~10 Grace cores; GPU decoding adds only ~10–15 W over idle and leaves most CPU cores free for higher‑layer processing.
- Methodology that uses high‑level Sionna/TensorFlow APIs (no hand‑tuned CUDA), establishing a conservative lower bound and a reusable scriptable framework for future accelerator evaluations.
- Roadmap for extending the approach to upcoming Grace/Blackwell generations and other physical‑layer kernels (e.g., FFT, channel estimation).
Methodology
- Simulation Stack – The authors assembled an NR‑like PHY chain in TensorFlow using NVIDIA’s open‑source Sionna library:
- LDPC5G encoder & decoder
- 16‑QAM modulation
- AWGN channel model
- Workload Sweep – They varied two key parameters:
- Parallel codewords decoded simultaneously (to stress concurrency)
- Number of belief‑propagation iterations (10, 15, 20, etc.) which directly impacts decoding quality and compute load.
- Execution Platforms – The same TensorFlow graph was run on:
- Grace CPU (running on the DGX Spark’s ARM‑based cores)
- Blackwell GB10 GPU (leveraging TensorFlow’s GPU backend).
- Metrics Collected – For each configuration they logged:
- Decoding throughput (codewords / second)
- End‑to‑end latency per codeword
- CPU & GPU utilization percentages
- Power draw (via NVIDIA‑SMI).
- No Hand‑Optimized Kernels – All compute was performed through Sionna’s high‑level ops, ensuring the results reflect what a typical AI‑native stack could achieve without custom CUDA kernels.
Results & Findings
| Configuration | CPU Throughput (cw/s) | GPU Throughput (cw/s) | Speedup | CPU Latency per cw | GPU Latency per cw |
|---|---|---|---|---|---|
| 20 iterations, 1 cw | 1.4k | 8.6k | ~6× | 0.71 ms (misses slot) | 0.12 ms (well within) |
| 20 iterations, 8 cw | 11k | 65k | ~6× | 0.73 ms | 0.14 ms |
| 10 iterations, 1 cw | 2.9k | 17k | ~6× | 0.38 ms (fits) | 0.06 ms |
- Throughput consistently scales linearly with the number of parallel codewords on the GPU, while the CPU quickly saturates after a few cores.
- Latency on the GPU stays under 0.12 ms even for the most demanding 20‑iteration case, giving a comfortable margin inside the 0.5 ms slot.
- Power: GPU decoding adds only ~10–15 W over idle, whereas the CPU version drives the Grace cores to near‑full power (~120 W for the 10‑core slice).
- Utilization: GPU runs at ~70 % compute utilization, leaving headroom for other AI workloads; the CPU is maxed out, leaving little capacity for higher‑layer tasks like HARQ or MAC scheduling.
Practical Implications
- Base‑station design – Offloading LDPC to the integrated GPU can free up CPU cycles for real‑time control‑plane functions, enabling more users, higher bandwidths, or advanced AI‑driven scheduling without hardware upgrades.
- Cost‑effective scaling – Since the performance gain is achieved with a standard TensorFlow/Sionna stack, operators can reap benefits without writing custom CUDA kernels, lowering development effort and maintenance overhead.
- Energy efficiency – A modest power increase on the GPU translates into a lower overall system TDP for the same throughput, which is attractive for edge‑deployed O‑RAN units where power budgets are tight.
- Future‑proofing – The methodology can be reused to evaluate upcoming Grace/Blackwell chips, as well as other PHY kernels (FFT, channel estimation). This helps vendors decide where to invest in accelerator support for the next 5G‑Advanced or 6G releases.
- AI‑native O‑RAN – The results demonstrate that an AI‑centric software stack (TensorFlow + Sionna) can already meet hard real‑time constraints, encouraging further integration of AI/ML pipelines into the physical layer.
Limitations & Future Work
- Conservative benchmark – Because the study relies on high‑level Sionna ops, it likely underestimates the ultimate performance achievable with hand‑optimized CUDA kernels or mixed‑precision tricks.
- Single‑node focus – Experiments were run on a single DGX Spark; scaling across multiple nodes or in a distributed O‑RAN deployment remains untested.
- Channel model simplicity – Only AWGN was considered; real‑world fading, mobility, and interference could affect decoding workload and latency.
- Power measurement granularity – System‑wide power was logged; a finer breakdown (GPU core vs. memory vs. CPU) would help pinpoint optimization opportunities.
- Future work – The authors suggest extending the framework to evaluate other NR PHY blocks, exploring mixed‑precision inference for LDPC, and testing on upcoming Grace/Blackwell generations (Aerial/ACAR/AODT) to verify whether the 6× speedup scales further.
Authors
- Ryan Barker
- Fatemeh Afghah
Paper Information
- arXiv ID: 2602.04652v1
- Categories: cs.DC
- Published: February 4, 2026
- PDF: Download PDF