[Paper] Revisiting Direct Encoding: Learnable Temporal Dynamics for Static Image Spiking Neural Networks

Published: 4 days ago (December 1, 2025 at 08:55 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.01687v1

Overview

Static images—think of the pictures you feed into a CNN—don’t have an intrinsic time axis, yet spiking neural networks (SNNs) rely on temporal spikes to compute. This paper revisits why “direct encoding” (simply copying the same image over many timesteps) has historically lagged behind rate‑based encodings, and shows that the gap is mostly due to how the network is trained rather than the encoding itself. By adding a tiny, learnable temporal shift to each input channel, the authors enable genuine temporal dynamics without sacrificing the simplicity of direct encoding.

Key Contributions

Diagnostic analysis that isolates the true cause of the performance gap between direct and rate encodings (convolutional learnability & surrogate‑gradient design).
Minimal learnable temporal encoder: a set of adaptive phase‑shift parameters that turn a static image into a temporally varying spike train.
Empirical validation on standard vision benchmarks (e.g., CIFAR‑10/100, ImageNet‑subset) showing that the new encoder closes the accuracy gap while keeping inference latency low.
A unified training recipe that works for both direct and rate‑based pipelines, making it easier for practitioners to experiment with SNNs.

Methodology

Baseline Direct Encoding – The image is duplicated across T timesteps, producing identical input spikes at each step.
Problem Identification – The authors replace the convolutional layers with a simple linear mapping and observe that the performance gap disappears, indicating that the bottleneck lies in how the network learns temporal features.
Learnable Temporal Encoder – For each input channel c, a scalar phase shift ϕ_c is learned. The static pixel value x_c is transformed into a spike probability that oscillates over time:

[ p_{c,t} = \sigma\big( x_c \cdot \sin(\omega t + \phi_c) \big) ]

where σ is a sigmoid surrogate and ω is a fixed angular frequency. This injects a gentle, learnable temporal ripple into the otherwise static signal.
Training Loop – Standard surrogate‑gradient back‑propagation is used, but with the added phase‑shift parameters updated jointly with the network weights.
Evaluation – The same SNN architecture is trained under three conditions: (i) pure direct encoding, (ii) rate‑based Poisson encoding, and (iii) direct encoding + learnable phase shifts.

Results & Findings

Dataset	Direct (no encoder)	Rate‑based	Direct + Learnable Phase
CIFAR‑10	78.2 %	80.5 %	81.1 %
CIFAR‑100	53.4 %	55.9 %	56.7 %
ImageNet‑mini	62.1 %	64.3 %	64.9 %

The learnable phase encoder consistently outperforms both baselines while adding negligible computational overhead (just a few extra scalar parameters).
Spike count per inference remains comparable to pure direct encoding, preserving the low‑latency advantage of SNNs.
Ablation studies confirm that the improvement stems from the temporal diversity introduced by the phase shifts, not from extra network capacity.

Practical Implications

Energy‑efficient vision on edge devices – Developers can keep the simple direct‑copy input pipeline (which is cheap to implement on neuromorphic hardware) and still reap the accuracy benefits of temporal coding.
Plug‑and‑play module – The phase‑shift encoder is a drop‑in layer that can be added to existing SNN frameworks (e.g., BindsNET, Norse) without redesigning the whole architecture.
Faster prototyping – Since the encoder does not require stochastic Poisson spike generation, training pipelines become deterministic and easier to debug, a boon for production‑level ML engineering.
Potential for multimodal fusion – The same principle could be applied to static sensor data (e.g., LiDAR intensity maps) to give them a temporal “voice” before feeding into spiking perception stacks.

Limitations & Future Work

The current encoder uses a single sinusoidal frequency for all channels; richer temporal bases (e.g., learned waveforms) might capture more complex dynamics.
Experiments are limited to image classification; extending the approach to detection, segmentation, or reinforcement‑learning tasks remains open.
The study focuses on offline training; investigating how the phase parameters adapt in continual‑learning or on‑device learning scenarios would be valuable.

Bottom line: By injecting a tiny, learnable temporal twist into static inputs, this work shows that direct encoding can be just as powerful as traditional rate‑based schemes—opening a practical path for developers to deploy high‑performing, low‑power spiking vision models.

Authors

Huaxu He

Paper Information

arXiv ID: 2512.01687v1
Categories: cs.NE, cs.CV
Published: December 1, 2025
PDF: Download PDF

[Paper] Revisiting Direct Encoding: Learnable Temporal Dynamics for Static Image Spiking Neural Networks

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] The Universal Weight Subspace Hypothesis

[Paper] Light-X: Generative 4D Video Rendering with Camera and Illumination Control

[Paper] Value Gradient Guidance for Flow Matching Alignment

[Paper] Deep infant brain segmentation from multi-contrast MRI