[Paper] EmambaIR: Efficient Visual State Space Model for Event-guided Image Reconstruction

Published: 3 days ago (May 8, 2026 at 01:56 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2605.08073v1

Overview

The paper presents EmambaIR, a new visual State Space Model (SSM) that reconstructs high‑quality images from event‑camera streams while keeping compute and memory footprints low. By replacing heavyweight CNN/ViT pipelines with a clever combination of sparse attention and a gated SSM, the authors achieve state‑of‑the‑art results on deblurring, deraining, and HDR enhancement—tasks that are directly relevant to real‑time vision systems.

Key Contributions

Top‑k Sparse Attention Module (TSAM) – Performs pixel‑level cross‑modal attention limited to the most relevant k locations, dramatically reducing the quadratic cost of traditional self‑attention.
Gated State‑Space Module (GSSM) – Enhances a linear‑complexity ( O(n) ) SSM with a nonlinear gating mechanism, enabling global temporal context modeling without the O(n²) blow‑up of Vision Transformers.
Unified Architecture – Seamlessly integrates event data (asynchronous, sparse spikes) with conventional frame data, making the model applicable to multiple image‑reconstruction problems.
Efficiency Gains – Empirically cuts memory usage by up to 45 % and FLOPs by ~30 % compared with leading CNN/ViT baselines, while delivering higher PSNR/SSIM scores.
Open‑source Release – Full code, pretrained weights, and benchmark scripts are provided, facilitating rapid adoption and reproducibility.

Methodology

Input Representation – Event streams are first voxelized into a compact spatio‑temporal tensor; a conventional RGB frame (when available) is processed in parallel.
Cross‑modal TSAM – For each pixel, the module computes similarity scores between event‑derived features and frame features, then retains only the top‑k highest scores. This sparse set drives the attention weighting, keeping the operation linear in the number of pixels while still capturing the most informative cross‑modal cues.
Temporal Modeling with GSSM – The retained features feed into a linear‑complexity SSM that treats the sequence of event slices as a dynamical system. A learned gating function (similar to a GRU/GLU) injects nonlinearity, allowing the SSM to model long‑range dependencies and global context without resorting to full‑matrix multiplications.
Reconstruction Head – The fused representation is upsampled and passed through a lightweight decoder (a few convolutional layers) to produce the final restored image.
Training – The network is trained end‑to‑end with a combination of L1 loss, perceptual loss (VGG‑based), and a temporal consistency term that penalizes flicker across successive reconstructions.

Results & Findings

Quantitative Gains – Across six public datasets, EmambaIR improves PSNR by 1.2–2.5 dB and SSIM by 0.02–0.04 over the previous best CNN/ViT methods.
Speed & Memory – On a 1080 Ti GPU, inference runs at ~45 fps for 720p inputs, compared to ~30 fps for the closest ViT baseline. Peak GPU memory drops from ~6 GB to ~3.3 GB.
Task Generality – The same backbone handles motion deblurring, rain removal, and HDR tone mapping without task‑specific redesign, demonstrating the flexibility of the sparse attention + gated SSM paradigm.
Ablation Insights – Removing TSAM (using dense attention) inflates FLOPs by 2.8× with negligible accuracy change, while disabling the gating in GSSM reduces PSNR by ~0.8 dB, confirming both components are essential for the efficiency‑accuracy trade‑off.

Practical Implications

Edge Devices & Robotics – The O(n) temporal core and top‑k attention make EmambaIR suitable for low‑power platforms (e.g., NVIDIA Jetson, ARM‑based drones) that need real‑time image enhancement from event cameras.
High‑Resolution Vision – Because the computational cost scales linearly with pixel count, developers can now process 4K streams without the quadratic memory explosion that plagues ViTs.
Plug‑and‑Play Module – TSAM and GSSM are released as independent PyTorch modules, allowing teams to drop them into existing pipelines (e.g., SLAM front‑ends, autonomous driving perception stacks) to boost image quality with minimal code changes.
Reduced Bandwidth for Remote Sensing – Event cameras generate sparse data; EmambaIR’s efficient fusion means less data needs to be transmitted for cloud‑based post‑processing, saving bandwidth in IoT deployments.

Limitations & Future Work

Event Pre‑processing Overhead – Voxelization still adds a modest latency; exploring learned event encodings could further streamline the pipeline.
Fixed Top‑k Hyper‑parameter – The current implementation uses a static k value; adaptive sparsity based on scene dynamics might improve robustness in highly cluttered environments.
Temporal Horizon – While GSSM captures long‑range dependencies, extremely long event sequences (>1 s) may still suffer from drift; integrating hierarchical SSMs or recurrent memory could address this.
Broader Modalities – Extending the framework to fuse LiDAR or radar streams alongside events is an open avenue the authors suggest for future multimodal perception research.

Authors

Wei Yu
Yunhang Qian

Paper Information

arXiv ID: 2605.08073v1
Categories: cs.CV, cs.AI
Published: May 8, 2026
PDF: Download PDF

[Paper] EmambaIR: Efficient Visual State Space Model for Event-guided Image Reconstruction

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Normalizing Trajectory Models

[Paper] Flow-OPD: On-Policy Distillation for Flow Matching Models

[Paper] SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation

[Paper] 123D: Unifying Multi-Modal Autonomous Driving Data at Scale