[Paper] Scalable Construction of Spiking Neural Networks using up to thousands of GPUs

Published: (December 10, 2025 at 05:27 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.09502v1

Overview

The paper introduces a new way to build and run massive spiking‑neural‑network (SNN) simulations on modern GPU clusters, scaling up to thousands of GPUs. By redesigning the network‑construction phase and leveraging MPI‑based communication, the authors make it feasible to simulate cortical‑size models (billions of synapses) with the performance needed for next‑generation exascale machines.

Key Contributions

  • Scalable construction pipeline – A distributed algorithm that lets each MPI rank assemble its own slice of the connectivity graph locally, avoiding a costly global assembly step.
  • GPU‑friendly data layout – Memory structures (compressed sparse rows, spike buffers, etc.) are organized to maximize coalesced accesses on NVIDIA GPUs.
  • Hybrid communication strategies – Demonstrates both point‑to‑point (pairwise) and collective (all‑to‑all) spike‑exchange mechanisms, showing when each is advantageous.
  • Performance benchmarks on real cortical models – Shows near‑linear weak scaling up to 2 000 GPUs for two benchmark networks (a balanced random network and a layered cortical microcircuit).
  • Open‑source reference implementation – The code is released as part of the NEST GPU simulator, enabling reproducibility and community extensions.

Methodology

  1. Partitioning the network – The full SNN is split into local sub‑networks, one per MPI process (and thus per GPU). Each process receives a random seed and a description of the global connectivity rule (e.g., connection probability, distance‑dependent profile).
  2. Local construction – Using the seed, each rank independently generates its pre‑ and post‑synaptic partner lists. The algorithm stores connections in a compressed sparse row (CSR) format that maps naturally onto GPU memory.
  3. Preparation for spike exchange – For every remote target rank, a spike‑send buffer is allocated. The authors pre‑compute a routing table that tells the simulator which outgoing spikes need to be packed for which destination.
  4. Communication layer – Two MPI‑based approaches are evaluated:
    • Point‑to‑point: each rank posts non‑blocking sends/receives only to the ranks it actually needs to talk to (sparse communication).
    • Collective: an MPI_Alltoallv is used when the network is dense enough that most ranks exchange spikes each timestep.
  5. Simulation loop – After the construction phase, the usual GPU kernel advances neuron states, generates spikes, packs them into the pre‑computed buffers, and triggers the MPI exchange. The next timestep begins once all incoming spikes have been unpacked.

Results & Findings

MetricPoint‑to‑point (2 000 GPUs)Collective (2 000 GPUs)
Weak‑scale efficiency92 % of ideal78 % of ideal
Construction time (per rank)≈ 0.8 s for 10⁶ neurons
Memory overhead (CSR)1.2 × neuron count
Spike‑exchange latency≈ 30 µs (average)≈ 45 µs (average)
  • The construction phase scales almost perfectly because it is embarrassingly parallel; adding more GPUs does not increase wall‑clock time.
  • For sparsely connected networks (typical of cortical models), the point‑to‑point scheme outperforms the collective approach, both in latency and bandwidth usage.
  • The overall simulation maintains >80 % parallel efficiency up to the largest tested configuration, confirming that the communication overhead does not dominate the compute cost.

Practical Implications

  • Large‑scale brain modeling – Researchers can now simulate cortical columns or whole‑brain fragments with realistic synapse counts on existing GPU clusters, shortening time‑to‑insight from weeks to days.
  • Neuroscience‑in‑the‑loop AI – The ability to run massive SNNs efficiently opens the door for hybrid AI systems that combine deep learning with biologically plausible spiking dynamics.
  • Exascale readiness – The construction and communication patterns are designed to map onto upcoming exascale architectures (e.g., with NVLink, high‑speed interconnects), making the code future‑proof for national labs and cloud providers.
  • Toolchain integration – Because the implementation sits on top of the widely used NEST simulator, developers can plug in custom neuron models, plasticity rules, or sensor interfaces without rewriting the low‑level GPU plumbing.
  • Performance‑aware design – The paper’s benchmarking methodology offers a template for other HPC developers who need to evaluate point‑to‑point vs. collective communication for irregular workloads.

Limitations & Future Work

  • Assumption of static connectivity – The current pipeline builds the network once; dynamic rewiring (e.g., structural plasticity) would require re‑construction or incremental updates, which are not covered.
  • GPU memory bound – Extremely dense networks may exceed per‑GPU memory even with CSR compression; the authors suggest out‑of‑core techniques as a next step.
  • Hardware specificity – Benchmarks focus on NVIDIA GPUs and InfiniBand; performance on AMD GPUs or emerging interconnects (e.g., Slingshot) remains to be validated.
  • Scalability beyond 2 000 GPUs – While the algorithms are theoretically exascale‑ready, empirical tests on >2 000 GPUs (or on true exascale systems) are left for future work.

Authors

  • Bruno Golosio
  • Gianmarco Tiddia
  • José Villamar
  • Luca Pontisso
  • Luca Sergi
  • Francesco Simula
  • Pooja Babu
  • Elena Pastorelli
  • Abigail Morrison
  • Markus Diesmann
  • Alessandro Lonardo
  • Pier Stanislao Paolucci
  • Johanna Senk

Paper Information

  • arXiv ID: 2512.09502v1
  • Categories: cs.DC, cs.NE, physics.comp-ph, q-bio.NC
  • Published: December 10, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »