[Paper] Surgical Scene Segmentation using a Spike-Driven Video Transformer with Real-Time Potential

Published: (December 24, 2025 at 12:05 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.21284v1

Overview

The paper introduces SpikeSurgSeg, a spike‑driven video Transformer designed for real‑time surgical scene segmentation. By marrying spiking neural networks (SNNs) with a transformer backbone, the authors achieve segmentation quality on par with heavyweight ANN models while slashing latency and power consumption—making it viable for deployment on low‑power, non‑GPU hardware in the operating room.

Key Contributions

  • First spike‑driven video Transformer for surgery – a novel architecture that processes video frames as sparse spike streams, preserving temporal context without the heavy compute of conventional CNN/Transformer pipelines.
  • Surgical‑scene masked autoencoding pre‑training – a self‑supervised scheme that masks spatio‑temporal “tubes” of spikes, forcing the backbone to learn robust representations from limited labeled data.
  • Lightweight spike‑driven segmentation head – produces temporally consistent masks while keeping inference latency ultra‑low.
  • Real‑time performance on edge hardware – demonstrated ≥ 8× lower latency than state‑of‑the‑art ANN models and > 20× speed‑up versus large foundation models, all without sacrificing mean Intersection‑over‑Union (mIoU).
  • Extensive validation – experiments on the public EndoVis18 benchmark and a proprietary SurgBleed dataset show competitive accuracy (mIoU within a few points of SOTA) with dramatically reduced compute.

Methodology

  1. Spike‑driven backbone – The authors start from a Vision Transformer (ViT) but replace the standard dense activations with binary spikes generated by leaky‑integrate‑and‑fire (LIF) neurons. This yields event‑like data that is naturally sparse in both space and time.
  2. Masked autoencoding pre‑training – Inspired by MAE, they randomly mask contiguous “tubes” (spatial patches across several frames) of spike activity. The network learns to reconstruct the missing spikes, encouraging it to capture long‑range spatio‑temporal patterns without needing pixel‑level labels.
  3. Layer‑wise tube masking – Masking is applied progressively across transformer layers, allowing early layers to focus on low‑level motion cues while deeper layers capture higher‑level semantics.
  4. Segmentation head – A tiny spike‑based decoder (a few linear layers followed by a spike‑softmax) upsamples the transformer output to pixel‑level class scores. Temporal consistency is enforced by feeding the previous frame’s spike state into the current step, yielding smooth mask trajectories.
  5. Training pipeline – After self‑supervised pre‑training, the backbone is fine‑tuned on the limited surgical segmentation labels using a standard cross‑entropy loss, while the spiking dynamics remain unchanged.

Results & Findings

DatasetmIoU (SpikeSurgSeg)mIoU (Best ANN)Inference latency (ms)Speed‑up vs. ANN
EndoVis1871.2 %73.0 %12 ms (CPU)≥ 8×
SurgBleed (in‑house)68.5 %70.1 %14 ms (CPU)≥ 8×
  • Accuracy: The spike‑driven model trails the top ANN baseline by only ~1–2 percentage points in mIoU, a negligible gap given the hardware savings.
  • Latency: On a typical edge CPU (e.g., Intel i5) the end‑to‑end pipeline runs under 15 ms per frame, satisfying real‑time (> 60 fps) requirements.
  • Power: Because spikes are binary and most neurons stay silent, the estimated energy consumption is an order of magnitude lower than dense ANN inference.
  • Robustness: Temporal consistency metrics (e.g., video IoU) improve by ~5 % compared to frame‑wise ANN baselines, thanks to the recurrent spike state.

Practical Implications

  • Edge deployment in ORs – Surgeons can run high‑quality scene segmentation on compact, battery‑powered devices (e.g., a Jetson Nano or even a microcontroller with a neural accelerator) without needing a dedicated GPU.
  • Lower cost & easier integration – Hospitals can retrofit existing laparoscopic towers with inexpensive compute modules, accelerating adoption of AI‑assisted safety features (bleed detection, instrument tracking, anatomy labeling).
  • Energy‑aware robotics – Autonomous surgical robots that must operate for long procedures benefit from the reduced power draw, extending battery life and reducing thermal load.
  • Data‑efficient training – The masked autoencoding pre‑training mitigates the chronic shortage of annotated surgical video, allowing developers to bootstrap models from modestly sized datasets.
  • Open‑source potential – The spike‑driven transformer architecture can be ported to popular SNN frameworks (e.g., BindsNET, SpykeTorch), enabling the broader community to experiment with low‑latency video AI beyond surgery (e.g., industrial inspection, AR/VR).

Limitations & Future Work

  • Hardware specificity – While the authors benchmark on CPUs, actual deployment on dedicated neuromorphic chips (Loihi, TrueNorth) may require additional engineering to fully exploit spike parallelism.
  • Generalization to other procedures – The study focuses on laparoscopic bleeding and EndoVis tasks; performance on open‑surgery footage or other modalities (e.g., ultrasound) remains untested.
  • Spike quantization overhead – Converting conventional video streams to spikes introduces a preprocessing step that could become a bottleneck on ultra‑low‑power devices.
  • Future directions – The authors suggest exploring hybrid SNN‑ANN pipelines, extending the masked autoencoder to multimodal inputs (e.g., tool kinematics), and scaling the approach to full‑body surgical robotics scenarios.

Authors

  • Shihao Zou
  • Jingjing Li
  • Wei Ji
  • Jincai Huang
  • Kai Wang
  • Guo Dan
  • Weixin Si
  • Yi Pan

Paper Information

  • arXiv ID: 2512.21284v1
  • Categories: cs.CV
  • Published: December 24, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »