[Paper] Geometry-Aware State Space Model: A New Paradigm for Whole-Slide Image Representation
Source: arXiv - 2605.05164v1
Overview
The paper proposes Geometry‑Aware State Space Model (BatMIL), a new way to represent whole‑slide histopathology images (WSIs). By embedding patch features simultaneously in Euclidean and hyperbolic spaces and processing them with a linear‑time state‑space sequence model, the authors achieve more accurate slide‑level predictions while keeping the computation tractable for gigapixel data.
Key Contributions
- Dual‑geometry embedding: Introduces a hybrid Euclidean‑hyperbolic representation that captures both local cellular details (Euclidean) and hierarchical tissue organization (hyperbolic).
- Linear‑complexity sequence encoder: Leverages the Structured State Space (S4) model to encode thousands of patch embeddings with O(N) time and memory, where N is the number of patches.
- Chunk‑level Mixture‑of‑Experts (MoE): Dynamically groups patches into regional “chunks” and routes each chunk to specialized expert subnetworks, improving expressiveness and reducing redundant computation.
- Comprehensive evaluation: Benchmarks BatMIL on seven WSI datasets covering six cancer types, consistently beating state‑of‑the‑art Multiple Instance Learning (MIL) baselines.
- Open‑source implementation: Provides code and pretrained models, facilitating reproducibility and downstream integration.
Methodology
- Patch extraction & initial embedding – The WSI is tiled into thousands of non‑overlapping patches; each patch is passed through a standard CNN backbone (e.g., ResNet‑50) to obtain a feature vector.
- Dual‑space projection – The same vector is projected into:
- Euclidean space for fine‑grained morphology, using a linear layer.
- Hyperbolic space (Poincaré ball) for hierarchical relationships, using a Möbius linear map.
- Sequence modeling with S4 – The ordered list of dual‑space embeddings is fed to an S4 layer, a state‑space model that approximates long‑range dependencies with linear computational cost, unlike quadratic Transformers.
- Chunk‑level MoE routing – The sequence is split into contiguous “chunks” (e.g., 64 patches). A lightweight gating network predicts a distribution over a set of expert subnetworks; each chunk is processed by its most relevant expert, allowing region‑specific feature refinement.
- Slide‑level aggregation & classification – The expert‑refined outputs are pooled (attention‑weighted) to produce a slide‑level representation, which is finally classified with a fully‑connected head.
The whole pipeline is end‑to‑end differentiable, enabling joint learning of the dual embeddings, the S4 encoder, and the MoE routing.
Results & Findings
| Dataset (Cancer) | Baseline MIL (e.g., CLAM) | BatMIL (Ours) | Relative ↑ Accuracy |
|---|---|---|---|
| Camelyon16 (Breast) | 84.2 % | 89.7 % | +5.5 % |
| TCGA‑LUAD (Lung) | 78.1 % | 83.4 % | +5.3 % |
| TCGA‑COAD (Colon) | 81.5 % | 86.9 % | +5.4 % |
| … (4 more) | — | — | — |
- Speed: Processing a 100 k‑patch slide takes ~0.9 s on a single RTX 3090, ~2× faster than a Transformer‑based MIL model with comparable accuracy.
- Ablation: Removing the hyperbolic branch drops accuracy by ~3 %; swapping S4 for a vanilla LSTM reduces performance by ~2 % and increases runtime by ~1.8×.
- Interpretability: Attention maps derived from the hyperbolic embeddings highlight macro‑architectural regions (e.g., tumor nests), while Euclidean attention focuses on cellular details, offering a richer visual explanation.
Practical Implications
- Scalable pathology pipelines: Developers can integrate BatMIL into digital pathology platforms to obtain slide‑level diagnoses without prohibitive GPU memory footprints.
- Better triage for pathologists: Higher‑accuracy predictions and region‑level attention maps can prioritize slides that need expert review, reducing workload.
- Transferable to other gigapixel domains: The dual‑geometry + S4 + MoE recipe is applicable to satellite imagery, large‑scale document analysis, or any task that requires aggregating millions of local descriptors.
- Edge‑friendly deployment: Linear‑time S4 inference makes it feasible to run inference on modest GPU or even high‑end CPU servers, opening possibilities for cloud‑based or on‑premise pathology services.
Limitations & Future Work
- Hyperbolic curvature tuning: The current implementation uses a fixed curvature; learning curvature per dataset could further improve hierarchical modeling.
- Chunk granularity sensitivity: Performance varies with chunk size; an adaptive chunking strategy based on tissue heterogeneity is left for future exploration.
- Limited modality testing: Experiments focus on H&E‑stained slides; extending to multiplexed immunofluorescence or radiology‑pathology multimodal data remains an open direction.
Overall, BatMIL demonstrates that geometry‑aware representations combined with efficient sequence modeling can push computational pathology toward more accurate, interpretable, and scalable solutions.
Authors
- Enhui Chai
- Sicheng Chen
- Tianyi Zhang
- Chad Wong
- Kecheng Huang
- Zeyu Liu
- Fei Xia
Paper Information
- arXiv ID: 2605.05164v1
- Categories: cs.CV, cs.AI
- Published: May 6, 2026
- PDF: Download PDF