[Paper] Parallel Quadratic Selected Inversion in Quantum Transport Simulation

Published: (January 8, 2026 at 08:03 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.04904v1

Overview

The paper presents a new set of distributed algorithms that dramatically speed up quantum‑transport (QT) simulations of nanoscale transistors. By extending the classic recursive Green’s function (RGF) technique to run efficiently across many GPUs, the authors enable fast selected inversion and selected solution of quadratic matrix equations—the two most expensive steps in the non‑equilibrium Green’s function (NEGF) formalism. The result is a solver that can handle much larger, multi‑terminal device geometries than previous approaches.

Key Contributions

  • Distributed RGF‑based solvers for selected inversion (SI) and selected quadratic (SQ) matrix solves that scale across multiple GPUs.
  • Support for block‑tridiagonal matrices with an arrowhead structure, enabling simulation of multi‑terminal transistor layouts.
  • Fusion of SI and SQ steps into a single pipeline, reducing data movement and memory overhead.
  • Performance comparison against the state‑of‑the‑art sparse direct solver PARDISO, showing a 5.2× speed‑up on 16 GPUs for a realistic nano‑ribbon transistor case.
  • Demonstration that the new method can simulate devices 16× longer than what PARDISO can handle on the same hardware.

Methodology

  1. NEGF Background – The NEGF formalism requires the Green’s function (G = (E I - H - \Sigma)^{-1}) and related quantities. Computing only a subset of matrix entries (selected inversion) and solving quadratic matrix equations of the form (X = A^{-1} B A^{-T}) are the bottlenecks.
  2. Recursive Green’s Function (RGF) – Traditionally, RGF exploits the block‑tridiagonal (BT) structure of the Hamiltonian (H) to compute Green’s functions sequentially. It is highly GPU‑friendly but limited to shared‑memory parallelism and single‑GPU execution.
  3. Parallel Extension – The authors reorganize the RGF recursion into independent sub‑problems that can be assigned to different GPU ranks. They introduce a pipeline that overlaps communication (MPI) with computation, allowing the selected inversion and quadratic solve to proceed concurrently across the device.
  4. Arrowhead BT Matrices – For multi‑terminal devices, the BT matrix gains an extra “arrowhead” block coupling all terminals. The new algorithm treats this block as a low‑rank update, preserving the parallel efficiency of the original RGF.
  5. Fusion of SI & SQ – By merging the two stages, intermediate results are kept on‑GPU, cutting the costly host‑to‑device transfers and reducing overall memory footprint.

Results & Findings

MetricPARDISO (single GPU)New Distributed RGF (16 GPUs)
Device length simulated1 µm (baseline)16 µm (16× longer)
Total runtime (SI+SQ)1.0 × (baseline)0.19 × (5.2× faster)
Memory usage per GPUNear saturation~30 % lower (thanks to fusion)
Strong scaling efficiency~78 % up to 16 GPUs

The experiments on a realistic nano‑ribbon transistor confirm that the distributed approach not only scales well but also outperforms a highly optimized sparse direct solver when the problem size grows. The arrowhead extension successfully handled a three‑terminal configuration without sacrificing performance.

Practical Implications

  • Larger Device Simulations – Engineers can now simulate transistors that are an order of magnitude longer or more complex (e.g., multi‑gate, multi‑terminal) without resorting to coarse approximations.
  • GPU‑Centric Workflows – The algorithms fit naturally into existing CUDA‑based HPC stacks, making it easy to integrate into commercial TCAD tools that already leverage GPUs for other physics kernels.
  • Reduced Time‑to‑Solution – Faster NEGF solves translate directly into shorter design cycles for nano‑electronics, enabling rapid prototyping of novel device concepts such as tunnel FETs or 2‑D material channels.
  • Energy‑Efficient Computing – By keeping most data on the GPU and minimizing host‑GPU traffic, the method lowers overall power consumption compared to CPU‑heavy sparse solvers.
  • Open‑Source Potential – The techniques are built on standard MPI + CUDA primitives, suggesting that a community‑driven implementation could quickly spread across research labs and industry.

Limitations & Future Work

  • GPU Memory Bound – Although the fused pipeline reduces memory pressure, extremely large 3‑D device meshes may still exceed the memory capacity of current GPUs.
  • Assumption of BT/Arrowhead Structure – The method relies on the underlying Hamiltonian having a (near) block‑tridiagonal form; highly irregular sparsity patterns would require additional preprocessing.
  • Scalability Beyond 16 GPUs – The paper reports strong scaling up to 16 GPUs; extending to larger GPU clusters will need further optimization of communication patterns and load balancing.
  • Integration with Full TCAD Suites – Future work could focus on coupling the solver with self‑consistent Poisson solvers and electron‑phonon scattering models to deliver end‑to‑end device simulation pipelines.

Overall, the research pushes the frontier of quantum‑transport simulation toward the scale demanded by next‑generation nano‑electronics, offering a practical, GPU‑accelerated path for developers and engineers to explore ever‑smaller transistor designs.

Authors

  • Vincent Maillou
  • Matthias Bollhofer
  • Olaf Schenk
  • Alexandros Nikolaos Ziogas
  • Mathieu Luisier

Paper Information

  • arXiv ID: 2601.04904v1
  • Categories: cs.DC, cs.PF
  • Published: January 8, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »