[Paper] Algorithm-hardware co-design of neuromorphic networks with dual memory pathways
Source: arXiv - 2512.07602v1
Overview
This paper tackles a long‑standing bottleneck in neuromorphic engineering: how to keep spiking neural networks (SNNs) both energy‑efficient and memory‑light while still being able to remember context over long time horizons. By co‑designing the algorithm (a dual‑memory pathway network) together with a custom near‑memory compute substrate, the authors demonstrate a system that matches or exceeds state‑of‑the‑art accuracy on long‑sequence tasks while slashing parameters, latency, and power consumption.
Key Contributions
- Dual‑Memory Pathway (DMP) architecture – introduces a slow, low‑dimensional memory vector per layer that aggregates recent spiking activity, mirroring the brain’s fast‑slow cortical organization.
- Parameter‑efficient learning – DMP networks achieve competitive accuracy on long‑sequence benchmarks with 40‑60 % fewer parameters than comparable SNNs.
- Near‑memory compute hardware – a heterogeneous accelerator that keeps the compact DMP state on‑chip, enabling tight coupling of sparse spike processing and dense memory updates.
- Performance gains – experimental silicon results show >4× higher throughput and >5× better energy efficiency versus the best existing neuromorphic implementations.
- Algorithm‑hardware co‑design methodology – demonstrates how biologically inspired abstractions can be turned into concrete hardware primitives that scale.
Methodology
-
Algorithm side
- Each network layer contains two pathways:
- Fast pathway: conventional spiking neurons that emit sparse binary events.
- Slow pathway: a small, continuous‑valued vector (the “slow memory”) that is updated each timestep with a lightweight linear recurrence.
- The slow memory modulates the spiking thresholds and synaptic weights, providing a context window that persists across many spikes without requiring the network to keep a long spike train in memory.
- Training uses surrogate‑gradient back‑propagation, with an additional regularizer that encourages the slow memory to stay low‑dimensional.
- Each network layer contains two pathways:
-
Hardware side
- The accelerator is built around a near‑memory compute fabric: the slow memory vectors reside in local SRAM banks adjacent to the compute units, eliminating costly off‑chip traffic.
- Sparse spike engine processes binary events in an event‑driven fashion, while a dense compute engine updates the slow memory using simple matrix‑vector ops.
- A custom dataflow scheduler dynamically routes spikes to the appropriate compute lane and merges the resulting modulation back into the spike generation loop, preserving the event‑driven nature of the system.
-
Co‑design loop
- The DMP’s low‑dimensional state size was tuned to match the SRAM capacity of the hardware block, ensuring that the memory footprint stays within a few kilobytes per layer.
- Simulation‑in‑the‑loop verified that the algorithm’s accuracy was not compromised by the quantization and timing constraints of the hardware.
Results & Findings
| Metric | DMP + Near‑Memory HW | Prior SNN HW (state‑of‑the‑art) |
|---|---|---|
| Parameters (M) | 0.8‑1.2 (≈ 50 % reduction) | 1.5‑2.5 |
| Top‑1 accuracy (Long‑Seq) | 92.3 % (e.g., DVS‑Gesture) | 90.8 % |
| Throughput (M events/s) | 4.2× higher | – |
| Energy / inference (µJ) | 5.3× lower | – |
| Latency (ms) | < 5 ms for 1 s video | 20‑30 ms |
- The DMP network maintains high sparsity (≈ 2 % active spikes) while still capturing long‑range dependencies thanks to the slow memory.
- Hardware measurements on a 28 nm prototype chip confirm the theoretical gains: the near‑memory layout cuts DRAM accesses by > 90 % and the mixed sparse/dense pipeline keeps the compute units busy, eliminating stalls typical in pure spike‑only accelerators.
Practical Implications
- Edge AI devices (wearables, drones, IoT cameras) can now run sophisticated event‑driven perception models with sub‑millijoule budgets, extending battery life dramatically.
- Real‑time learning becomes feasible on‑chip: the slow memory can be updated online without moving large spike buffers, enabling adaptive filters for robotics or autonomous vehicles.
- The co‑design template (algorithm → compact state → near‑memory accelerator) can be reused for other neuromorphic workloads such as speech processing or tactile sensing, where long temporal context is essential.
- Developers can target the accelerator through a high‑level API (e.g., PyTorch‑like front‑end) that abstracts away the sparse/dense scheduling, lowering the barrier to entry for software engineers.
Limitations & Future Work
- The current hardware prototype is limited to fixed‑size slow memory vectors; scaling to deeper networks may require hierarchical memory tiling.
- Training still relies on offline surrogate‑gradient back‑propagation; integrating on‑chip learning rules (e.g., STDP) remains an open challenge.
- Benchmarks focus on vision‑centric event datasets; evaluating the DMP approach on audio or multimodal streams would broaden its applicability.
- The authors note that quantization effects on the slow memory become more pronounced at sub‑8‑bit precision, suggesting a need for mixed‑precision strategies in future silicon generations.
Authors
- Pengfei Sun
- Zhe Su
- Jascha Achterberg
- Giacomo Indiveri
- Dan F. M. Goodman
- Danyal Akarca
Paper Information
- arXiv ID: 2512.07602v1
- Categories: cs.NE
- Published: December 8, 2025
- PDF: Download PDF