[Paper] Unleashing Temporal Capacity of Spiking Neural Networks through Spatiotemporal Separation

Published: (December 5, 2025 at 02:05 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.05472v1

Overview

This paper challenges the common belief that the membrane‑potential dynamics of Spiking Neural Networks (SNNs) are the primary driver of their temporal reasoning ability. By systematically stripping away the “stateful” membrane propagation, the authors uncover a surprising trade‑off between spatial (semantic) and temporal (motion) capacity, and they propose a new architecture—Spatial‑Temporal Separable Network (STSep)—that explicitly splits these two roles to boost video‑understanding performance.

Key Contributions

  • Empirical dissection of temporal modeling in SNNs: Introduce Non‑Stateful (NS) variants that progressively remove membrane propagation, quantifying its impact layer‑by‑layer.
  • Discovery of spatio‑temporal resource competition: Show that excessive reliance on temporal state hurts spatial feature learning, while a moderate reduction can actually improve accuracy.
  • Design of STSep: A residual block split into two independent branches—one purely spatial (semantic extraction) and one purely temporal (explicit frame‑difference processing).
  • State‑of‑the‑art results on video benchmarks: Achieve superior accuracy on Something‑Something V2, UCF101, and HMDB51 compared with prior SNN baselines.
  • Interpretability evidence: Retrieval experiments and attention visualizations demonstrate that the temporal branch focuses on motion cues rather than static appearance.

Methodology

  1. Non‑Stateful (NS) Ablation:

    • Starting from a conventional SNN, the authors replace the membrane update (V[t] = α·V[t‑1] + I[t]) with a stateless version that discards the previous potential (V[t] = I[t]).
    • They apply this replacement selectively to shallow layers, deep layers, or all layers, creating a family of NS models.
  2. Quantitative Analysis:

    • Each NS variant is trained on video classification tasks, and performance is compared to the fully stateful baseline.
    • The authors track how accuracy changes as temporal state is removed, revealing a “sweet spot” where partial removal helps.
  3. STSep Architecture:

    • Spatial Branch: Standard convolutional residual block (no temporal state) that extracts high‑level semantics from each frame.
    • Temporal Branch: Computes explicit temporal differences between consecutive frames, feeds them through a lightweight spiking block, and aggregates motion information.
    • The two branches are merged via addition, preserving the overall residual structure while keeping the two processing streams independent.
  4. Training & Evaluation:

    • Use surrogate gradient descent to train the spiking networks end‑to‑end.
    • Evaluate on three popular video datasets (Something‑Something V2, UCF101, HMDB51) and also perform a video‑retrieval test to probe what the network attends to.

Results & Findings

DatasetBaseline SNN (stateful)Best NS variantSTSep (proposed)
Something‑Something V258.3 %60.1 % (partial removal)63.7 %
UCF10184.2 %85.0 %86.9 %
HMDB5155.6 %56.4 %58.8 %
  • Partial removal improves performance: Removing temporal state from only shallow or only deep layers yields a 1–2 % boost, confirming the competition hypothesis.
  • Full removal collapses learning: When all layers become stateless, accuracy drops dramatically, showing that some temporal capacity is still essential.
  • STSep outperforms all NS variants: By dedicating separate pathways, STSep captures motion without sacrificing semantic richness, leading to the best scores across the board.
  • Interpretability: Retrieval experiments show that videos retrieved by the temporal branch share motion patterns (e.g., “pushing” vs. “pulling”) rather than visual texture, and attention maps highlight moving regions.

Practical Implications

  • More efficient video models for edge devices: STSep retains the low‑power, event‑driven nature of SNNs while delivering higher accuracy, making it attractive for neuromorphic chips in surveillance cameras, drones, or AR glasses.
  • Design guideline for spiking architectures: When building SNNs for temporal data, allocate dedicated temporal modules (e.g., frame‑difference layers) instead of relying solely on membrane dynamics.
  • Simplified training pipelines: Decoupling spatial and temporal streams reduces the need for carefully tuned membrane decay constants, easing hyper‑parameter search for developers.
  • Potential for multimodal fusion: The spatial branch can be swapped with other feature extractors (e.g., audio or depth encoders) while keeping the temporal branch unchanged, facilitating cross‑modal spiking systems.

Limitations & Future Work

  • Dataset scope: Experiments focus on relatively short, trimmed video clips; performance on long, continuous streams (e.g., real‑time surveillance) remains untested.
  • Hardware validation: The paper reports FLOPs and accuracy but does not provide measurements on actual neuromorphic hardware (Loihi, TrueNorth), leaving energy‑efficiency gains to be verified.
  • Temporal granularity: The explicit difference operation assumes a fixed frame rate; adapting to variable‑rate event streams could require additional mechanisms.
  • Future directions: Extending STSep to handle asynchronous event cameras, exploring adaptive allocation of spatial vs. temporal capacity, and integrating learnable temporal kernels beyond simple differences.

Authors

  • Yiting Dong
  • Zhaofei Yu
  • Jianhao Ding
  • Zijie Xu
  • Tiejun Huang

Paper Information

  • arXiv ID: 2512.05472v1
  • Categories: cs.NE
  • Published: December 5, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »