How I Built Two Generations of Neuromorphic Processor From Scratch
Source: Dev.to
Overview
Your brain runs on about 20 W. It processes visual scenes, generates speech, and maintains balance — all simultaneously and in real time. The best GPU clusters in the world burn megawatts to approximate what 86 billion neurons do effortlessly.
Neuromorphic processors try to close that gap. Instead of shuttling numbers through ALUs, they mimic biology:
- neurons fire discrete spikes,
- synapses carry weighted connections,
- computation only happens when something actually changes.
Intel’s Loihi chip demonstrated that this can work at scale, but it is proprietary and requires access through Intel’s cloud service.
I built my own – two generations, from scratch, solo, as a university student.
N1 – First‑generation processor
Target: feature‑parity with Intel Loihi 1.
- 128‑core processor
- Each core contains 1 024 CUBA (current‑based) leaky‑integrate‑and‑fire neurons
- 131 072 synapses per core, stored in compressed‑sparse‑row (CSR) format
Headline features
| Feature | Description |
|---|---|
| Programmable microcode learning engine | 16 registers, 14 op‑codes. Each core runs a small program every timestep (STDP, three‑factor reward learning, homeostatic normalization, or any custom rule) – no RTL changes needed. |
| Dendritic compartment trees | 4 compartments per neuron with configurable join operations (ADD, ABS_MAX, OR, PASS). Dendrites perform local nonlinear processing before signals reach the soma. |
| 8‑bit graded spikes | Neurons carry intensity information, not just fire/no‑fire. This exceeds Loihi 1 (graded spikes only appear in Loihi 2). |
| 24‑bit state precision | One bit more than Loihi 1’s 23‑bit, with RAZ (round‑away‑from‑zero) arithmetic that prevents neurons from getting stuck at non‑resting potentials. |
| Triple RV32IMF RISC‑V cluster | Three embedded processors with IEEE 754 FPU, hardware breakpoints, and a shared mailbox for supervisory control. |
Validation
- 25 RTL testbenches covering 98 test scenarios – zero failures.
- SDK at this stage: 168 tests across 14 Python modules.
N2 – Second‑generation processor
Goal: replicate the architectural leap from Loihi 1 → Loihi 2 – making the neuron programmable.
In N1 every neuron runs the same hard‑coded CUBA LIF computation, which limits functionality (no bursting, adaptation, oscillation, graded‑error coding without RTL changes).
What changed
- Fixed datapath → fetch‑execute microcode engine.
- Each neuron runs its own program from instruction SRAM.
- Per‑neuron program‑offset register enables different neurons in the same core to execute different programs.
- Register file (R0‑R15) is loaded from neuron‑parameter SRAM each timestep.
- Instruction set includes arithmetic, shifts, min/max, conditional skips, and two spike‑emission modes: HALT (threshold‑based) and EMIT (forced payload).
This shift mirrors graphics: from fixed‑function pixel pipelines to programmable shaders. Once the neuron is programmable, the hardware becomes a platform rather than a fixed implementation.
Built‑in neuron models (microcode programs)
| Model | Description |
|---|---|
| CUBA LIF | Bit‑identical to N1’s fixed path – reproduces the exact same spike trains. |
| Izhikevich | Two‑variable quadratic model with four presets (regular spiking, intrinsic bursting, chattering, fast spiking). Uses MUL_SHIFT for the v²/2ˢ quadratic term. |
| Adaptive LIF | Adds a slow adaptation current that accumulates on spikes and decays exponentially → spike‑frequency adaptation. |
| Sigma‑Delta | Maintains a running prediction of input; emits the prediction error as a spike payload via EMIT. Achieves temporal sparsity for slowly‑varying signals. |
| Resonate‑and‑Fire | Damped oscillator that fires only when driven at its resonant frequency – no spectral computation needed. |
Additional architectural enhancements
- 4 graded‑spike payload formats (0/8/16/24 bit) – up from 8‑bit only in N1.
- Variable‑precision weight packing (1/2/4/8/16 bit) – 16× memory compression at 1‑bit; Loihi 2 only goes to 8‑bit, N2’s 9‑16 bit range helps networks needing higher precision.
- 5 spike traces (x1, x2, y1, y2, y3) – up from 2 in N1. Enables triplet STDP (Pfister & Gerstner 2006) and complex eligibility traces.
- Convolutional synapse encoding – stores weight kernels once per group; 2‑3× memory reduction for CNN topologies.
- Per‑synapse‑group plasticity enable – 30‑70 % learning‑phase speed‑up in mixed fixed/plastic networks.
- Persistent reward traces with exponential decay – enables temporal credit assignment for reinforcement learning.
- Homeostatic threshold plasticity – epoch‑based proportional error rule, prevents firing‑rate drift in recurrent networks.
- Full observability – 3 performance counters, 25‑variable state probes per neuron, 64‑deep trace FIFO, and energy metering.
- Hardware‑accurate simulation defaults – 24‑bit fixed‑point arithmetic, strict SRAM pool‑depth limits matching RTL.
Physical validation (AWS F2 instance, Xilinx VU47P)
| Metric | Value |
|---|---|
| Clock | 62.5 MHz neuromorphic clock / 250 MHz PCIe |
| Cores | 16‑core instance (full 128‑core design validated in simulation) |
| Integration tests | 28/28 passing |
| RTL‑level tests | 9 tests generating 163 000+ spikes with zero mismatches |
| Dual‑clock CDC | Gray‑code async FIFOs |
| Throughput | ~8 690 timesteps/second |
| BRAM utilization | 56 % aggregate for 16 cores (BRAM is the binding constraint) |
| Scalability | Full 128‑core design would need a larger device or multi‑FPGA partitioning. |
End‑to‑end demonstration
Task: Train a recurrent SNN on the Spiking Heidelberg Digits (SHD) dataset (10 420 spoken‑digit recordings encoded as 700‑channel cochlea spike trains).
- Architecture: 700 input → 768 recurrent hidden → 20 output.
- Training: Surrogate gradients (fast sigmoid) with AdamW.
- Quantisation: Weights quantised to 16 bits for hardware deployment.
Result:
- Accuracy before quantisation: 85.9 %
- Accuracy after quantisation: 85.4 % (‑0.4 % drop)
This surpasses published baselines: Cramer et al. (83.2 %) and Zenke & Vogels (83.4 %).
SDK growth
| Version | Tests | Python modules |
|---|---|---|
| N1 | 168 | 14 |
| N2 | (18× increase) – exact numbers omitted in original text but the growth factor is 18× compared to N1. |
TL;DR
- N1 – 128‑core, 1 024 CUBA LIF neurons/core, 8‑bit graded spikes, 24‑bit state, programmable microcode learning engine.
- N2 – adds a per‑neuron microcode engine, five built‑in neuron models, richer spike‑payload/weight formats, more trace memory, and a host of plasticity/observability features.
- Both generations have been thoroughly validated (RTL testbenches, integration tests, hardware runs) and demonstrated on a real SNN task with state‑of‑the‑art accuracy.
All content and numbers are retained from the original segment; the markdown has been cleaned up for readability while preserving the original structure.
N2
Statistics
| Category | Count | Additional |
|---|---|---|
| Test cases | 168 | 3,091 |
| Python modules | 14 | 88 |
| Neuron models | 1 | 5 |
| Synapse formats | 3 | 4 |
| Weight precisions | 1 | 5 |
| Features | — | 155 (152 FULL, 3 HW_ONLY) |
| Lines of Python | ~8 K | ~52 K |
Back‑ends
Three back‑ends (CPU cycle‑accurate, GPU via PyTorch, FPGA) share the same deploy / step / get_result API.
The GPU simulator delivers a 100–1000× speed‑up over the CPU version.
No‑install option – Catalyst Cloud
pip install catalyst-cloud
from catalyst_cloud import CatalystClient
client = CatalystClient(api_key="your_key")
network = {
"populations": [
{"name": "input", "n": 100, "params": {"threshold": 1000}},
{"name": "output", "n": 10, "params": {"threshold": 600}}
],
"connections": [
{"from": "input", "to": "output", "weight": 500, "probability": 0.3}
]
}
job = client.submit(network, timesteps=1000)
result = job.wait()
print(result.spike_counts)
- Free tier for research – no credit‑card required.
- Cloud API:
- Python cloud client:
pip install catalyst-cloud
Links & Resources
- GitHub repository:
- Full SDK source (from $25 /mo – full N1 + N2 source):
- Support / donations:
- Contact:
License
Licensed under BSL 1.1 – source‑available, free for research; commercial use requires a paid licence.
Project snapshot
- 238 development phases
- 2 processors
- 3,091 tests
- Built by a single developer at the University of Aberdeen
If you work on SNNs, neuromorphic computing, or alternative computing projects, feel free to get in touch!