[Paper] EmambaIR: Efficient Visual State Space Model for Event-guided Image Reconstruction

Published: (May 8, 2026 at 01:56 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2605.08073v1

Overview

The paper presents EmambaIR, a new visual State Space Model (SSM) that reconstructs high‑quality images from event‑camera streams while keeping compute and memory footprints low. By replacing heavyweight CNN/ViT pipelines with a clever combination of sparse attention and a gated SSM, the authors achieve state‑of‑the‑art results on deblurring, deraining, and HDR enhancement—tasks that are directly relevant to real‑time vision systems.

Key Contributions

  • Top‑k Sparse Attention Module (TSAM) – Performs pixel‑level cross‑modal attention limited to the most relevant k locations, dramatically reducing the quadratic cost of traditional self‑attention.
  • Gated State‑Space Module (GSSM) – Enhances a linear‑complexity ( O(n) ) SSM with a nonlinear gating mechanism, enabling global temporal context modeling without the O(n²) blow‑up of Vision Transformers.
  • Unified Architecture – Seamlessly integrates event data (asynchronous, sparse spikes) with conventional frame data, making the model applicable to multiple image‑reconstruction problems.
  • Efficiency Gains – Empirically cuts memory usage by up to 45 % and FLOPs by ~30 % compared with leading CNN/ViT baselines, while delivering higher PSNR/SSIM scores.
  • Open‑source Release – Full code, pretrained weights, and benchmark scripts are provided, facilitating rapid adoption and reproducibility.

Methodology

  1. Input Representation – Event streams are first voxelized into a compact spatio‑temporal tensor; a conventional RGB frame (when available) is processed in parallel.
  2. Cross‑modal TSAM – For each pixel, the module computes similarity scores between event‑derived features and frame features, then retains only the top‑k highest scores. This sparse set drives the attention weighting, keeping the operation linear in the number of pixels while still capturing the most informative cross‑modal cues.
  3. Temporal Modeling with GSSM – The retained features feed into a linear‑complexity SSM that treats the sequence of event slices as a dynamical system. A learned gating function (similar to a GRU/GLU) injects nonlinearity, allowing the SSM to model long‑range dependencies and global context without resorting to full‑matrix multiplications.
  4. Reconstruction Head – The fused representation is upsampled and passed through a lightweight decoder (a few convolutional layers) to produce the final restored image.
  5. Training – The network is trained end‑to‑end with a combination of L1 loss, perceptual loss (VGG‑based), and a temporal consistency term that penalizes flicker across successive reconstructions.

Results & Findings

  • Quantitative Gains – Across six public datasets, EmambaIR improves PSNR by 1.2–2.5 dB and SSIM by 0.02–0.04 over the previous best CNN/ViT methods.
  • Speed & Memory – On a 1080 Ti GPU, inference runs at ~45 fps for 720p inputs, compared to ~30 fps for the closest ViT baseline. Peak GPU memory drops from ~6 GB to ~3.3 GB.
  • Task Generality – The same backbone handles motion deblurring, rain removal, and HDR tone mapping without task‑specific redesign, demonstrating the flexibility of the sparse attention + gated SSM paradigm.
  • Ablation Insights – Removing TSAM (using dense attention) inflates FLOPs by 2.8× with negligible accuracy change, while disabling the gating in GSSM reduces PSNR by ~0.8 dB, confirming both components are essential for the efficiency‑accuracy trade‑off.

Practical Implications

  • Edge Devices & Robotics – The O(n) temporal core and top‑k attention make EmambaIR suitable for low‑power platforms (e.g., NVIDIA Jetson, ARM‑based drones) that need real‑time image enhancement from event cameras.
  • High‑Resolution Vision – Because the computational cost scales linearly with pixel count, developers can now process 4K streams without the quadratic memory explosion that plagues ViTs.
  • Plug‑and‑Play Module – TSAM and GSSM are released as independent PyTorch modules, allowing teams to drop them into existing pipelines (e.g., SLAM front‑ends, autonomous driving perception stacks) to boost image quality with minimal code changes.
  • Reduced Bandwidth for Remote Sensing – Event cameras generate sparse data; EmambaIR’s efficient fusion means less data needs to be transmitted for cloud‑based post‑processing, saving bandwidth in IoT deployments.

Limitations & Future Work

  • Event Pre‑processing Overhead – Voxelization still adds a modest latency; exploring learned event encodings could further streamline the pipeline.
  • Fixed Top‑k Hyper‑parameter – The current implementation uses a static k value; adaptive sparsity based on scene dynamics might improve robustness in highly cluttered environments.
  • Temporal Horizon – While GSSM captures long‑range dependencies, extremely long event sequences (>1 s) may still suffer from drift; integrating hierarchical SSMs or recurrent memory could address this.
  • Broader Modalities – Extending the framework to fuse LiDAR or radar streams alongside events is an open avenue the authors suggest for future multimodal perception research.

Authors

  • Wei Yu
  • Yunhang Qian

Paper Information

  • arXiv ID: 2605.08073v1
  • Categories: cs.CV, cs.AI
  • Published: May 8, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] Normalizing Trajectory Models

Diffusion-based models decompose sampling into many small Gaussian denoising steps -- an assumption that breaks down when generation is compressed to a few coar...