[Paper] EventQueues: Autodifferentiable spike event queues for brain simulation on AI accelerators

Published: (December 5, 2025 at 12:39 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.05906v1

Overview

The paper “EventQueues: Autodifferentiable spike event queues for brain simulation on AI accelerators” tackles a core bottleneck in spiking neural network (SNN) research: how to simulate large‑scale, event‑driven neural dynamics efficiently while still supporting gradient‑based learning. By redesigning the data structures that store spike events—so they can be differentiated automatically—the authors make it practical to run exact‑gradient SNN training on modern AI hardware such as GPUs, TPUs, and emerging low‑precision (LPU) units.

Key Contributions

  • Autodifferentiable event‑queue abstraction that captures both immediate and delayed spikes, enabling exact gradient computation without resorting to dense tensors.
  • Memory‑efficient queue implementations (tree‑based, FIFO, ring‑buffer, and sorting‑intrinsic variants) tailored to the strengths of different accelerator architectures.
  • Comprehensive benchmarking across CPUs, GPUs, TPUs, and LPUs, revealing how queue design dictates performance and memory usage.
  • Selective spike‑dropping strategy that offers a controllable trade‑off between simulation speed and training accuracy.
  • Open‑source reference implementation (compatible with major autodiff frameworks) that can be plugged into existing SNN toolkits.

Methodology

  1. Mathematical formulation – The authors start from the exact gradient of spike times with respect to network parameters, showing that the gradient can be expressed as a sum over event queues that store spike timestamps and their associated delays.
  2. Data‑structure design – Four queue variants are built:
    • Tree‑based priority queue (good for irregular, sparse spikes).
    • FIFO queue (simple, low‑overhead for moderate event rates).
    • Ring buffer (continuous memory layout, ideal for GPUs when the event count fits in fast shared memory).
    • Sorting‑intrinsic queue (leverages TPU‑specific tf.sort‑like ops to batch‑process spikes).
  3. Autodiff integration – Each queue is wrapped in a custom autograd primitive that records forward operations and supplies the corresponding backward logic, ensuring gradients flow through the event handling itself.
  4. Benchmark suite – Synthetic SNN workloads with varying neuron counts, connectivity sparsity, and delay distributions are executed on each hardware platform. Metrics include runtime, peak memory, and training loss convergence.
  5. Spike‑dropping experiment – A configurable probability is applied to discard low‑impact spikes during forward simulation, measuring the resulting speedup versus any degradation in loss or accuracy.

Results & Findings

PlatformBest QueueSpeedup vs. dense baselineMemory ReductionAccuracy impact (with 5 % drop)
CPUTree‑based3.2×≈ 70 %< 0.2 % loss
GPURing buffer (small nets)4.5×≈ 60 %< 0.3 % loss
GPU (large nets)Sparse FIFO2.8×≈ 80 %< 0.5 % loss
TPUSorting‑intrinsic3.9×≈ 65 %< 0.2 % loss
LPUSparse FIFO2.5×≈ 75 %< 0.4 % loss
  • Queue choice matters: CPUs thrive with classic priority‑queue structures; GPUs benefit from contiguous ring buffers until memory pressure forces a switch to sparser representations.
  • Delayed spikes are no longer a performance penalty: The unified queue abstraction handles arbitrary delays without extra copying or padding.
  • Selective spike dropping yields up to an additional 1.5× speedup with negligible impact on training loss, suggesting a practical knob for large‑scale experiments.

Practical Implications

  • Faster SNN prototyping: Developers can now train exact‑gradient SNNs on commodity GPUs or TPUs without the memory blow‑up that forced many to use surrogate gradients or event‑free approximations.
  • Scalable neuromorphic ML pipelines: The memory‑lean queues enable training of networks with millions of neurons and realistic synaptic delays—opening doors for biologically plausible models in robotics, brain‑computer interfaces, and low‑power edge AI.
  • Hardware‑aware library design: The benchmark results give clear guidance on which queue implementation to pick based on target hardware, allowing framework authors (e.g., Brian2, Norse, BindsNET) to expose a simple “backend” selector.
  • Energy‑efficient inference: By dropping low‑impact spikes, inference can be accelerated on LPUs or specialized neuromorphic chips, reducing power consumption while preserving model fidelity.

Limitations & Future Work

  • Sparse‑event overhead on very dense spiking regimes (e.g., high‑frequency bursting) can still saturate memory bandwidth, limiting the gains of the current queue designs.
  • Autodiff framework support is currently demonstrated for PyTorch and TensorFlow; integration with JAX or emerging MLIR‑based compilers remains to be explored.
  • Dynamic network topologies (e.g., structural plasticity) were not evaluated; extending the queue abstraction to handle on‑the‑fly graph changes is an open challenge.
  • The authors suggest future work on adaptive queue selection—automatically switching between implementations during a run—and co‑design of autograd primitives that can exploit divergent primal/tangent data structures for even tighter performance‑accuracy trade‑offs.

Authors

  • Lennart P. L. Landsmeer
  • Amirreza Movahedin
  • Said Hamdioui
  • Christos Strydis

Paper Information

  • arXiv ID: 2512.05906v1
  • Categories: cs.NE
  • Published: December 5, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »