[Paper] Revisiting Direct Encoding: Learnable Temporal Dynamics for Static Image Spiking Neural Networks
Source: arXiv - 2512.01687v1
Overview
Static images—think of the pictures you feed into a CNN—don’t have an intrinsic time axis, yet spiking neural networks (SNNs) rely on temporal spikes to compute. This paper revisits why “direct encoding” (simply copying the same image over many timesteps) has historically lagged behind rate‑based encodings, and shows that the gap is mostly due to how the network is trained rather than the encoding itself. By adding a tiny, learnable temporal shift to each input channel, the authors enable genuine temporal dynamics without sacrificing the simplicity of direct encoding.
Key Contributions
- Diagnostic analysis that isolates the true cause of the performance gap between direct and rate encodings (convolutional learnability & surrogate‑gradient design).
- Minimal learnable temporal encoder: a set of adaptive phase‑shift parameters that turn a static image into a temporally varying spike train.
- Empirical validation on standard vision benchmarks (e.g., CIFAR‑10/100, ImageNet‑subset) showing that the new encoder closes the accuracy gap while keeping inference latency low.
- A unified training recipe that works for both direct and rate‑based pipelines, making it easier for practitioners to experiment with SNNs.
Methodology
-
Baseline Direct Encoding – The image is duplicated across T timesteps, producing identical input spikes at each step.
-
Problem Identification – The authors replace the convolutional layers with a simple linear mapping and observe that the performance gap disappears, indicating that the bottleneck lies in how the network learns temporal features.
-
Learnable Temporal Encoder – For each input channel c, a scalar phase shift ϕ_c is learned. The static pixel value x_c is transformed into a spike probability that oscillates over time:
[ p_{c,t} = \sigma\big( x_c \cdot \sin(\omega t + \phi_c) \big) ]
where σ is a sigmoid surrogate and ω is a fixed angular frequency. This injects a gentle, learnable temporal ripple into the otherwise static signal.
-
Training Loop – Standard surrogate‑gradient back‑propagation is used, but with the added phase‑shift parameters updated jointly with the network weights.
-
Evaluation – The same SNN architecture is trained under three conditions: (i) pure direct encoding, (ii) rate‑based Poisson encoding, and (iii) direct encoding + learnable phase shifts.
Results & Findings
| Dataset | Direct (no encoder) | Rate‑based | Direct + Learnable Phase |
|---|---|---|---|
| CIFAR‑10 | 78.2 % | 80.5 % | 81.1 % |
| CIFAR‑100 | 53.4 % | 55.9 % | 56.7 % |
| ImageNet‑mini | 62.1 % | 64.3 % | 64.9 % |
- The learnable phase encoder consistently outperforms both baselines while adding negligible computational overhead (just a few extra scalar parameters).
- Spike count per inference remains comparable to pure direct encoding, preserving the low‑latency advantage of SNNs.
- Ablation studies confirm that the improvement stems from the temporal diversity introduced by the phase shifts, not from extra network capacity.
Practical Implications
- Energy‑efficient vision on edge devices – Developers can keep the simple direct‑copy input pipeline (which is cheap to implement on neuromorphic hardware) and still reap the accuracy benefits of temporal coding.
- Plug‑and‑play module – The phase‑shift encoder is a drop‑in layer that can be added to existing SNN frameworks (e.g., BindsNET, Norse) without redesigning the whole architecture.
- Faster prototyping – Since the encoder does not require stochastic Poisson spike generation, training pipelines become deterministic and easier to debug, a boon for production‑level ML engineering.
- Potential for multimodal fusion – The same principle could be applied to static sensor data (e.g., LiDAR intensity maps) to give them a temporal “voice” before feeding into spiking perception stacks.
Limitations & Future Work
- The current encoder uses a single sinusoidal frequency for all channels; richer temporal bases (e.g., learned waveforms) might capture more complex dynamics.
- Experiments are limited to image classification; extending the approach to detection, segmentation, or reinforcement‑learning tasks remains open.
- The study focuses on offline training; investigating how the phase parameters adapt in continual‑learning or on‑device learning scenarios would be valuable.
Bottom line: By injecting a tiny, learnable temporal twist into static inputs, this work shows that direct encoding can be just as powerful as traditional rate‑based schemes—opening a practical path for developers to deploy high‑performing, low‑power spiking vision models.
Authors
- Huaxu He
Paper Information
- arXiv ID: 2512.01687v1
- Categories: cs.NE, cs.CV
- Published: December 1, 2025
- PDF: Download PDF