[Paper] MuViT: Multi-Resolution Vision Transformers for Learning Across Scales in Microscopy
Source: arXiv - 2602.24222v1
Overview
Microscopy is generating ever‑larger images—often gigapixels—that capture biological structures at many different scales, from sub‑cellular details to whole‑tissue architecture. The paper “MuViT: Multi‑Resolution Vision Transformers for Learning Across Scales in Microscopy” proposes a new transformer‑based model that can simultaneously reason over these disparate resolutions, delivering more accurate analysis than traditional single‑scale Vision Transformers (ViTs) or convolutional networks.
Key Contributions
- True multi‑resolution attention: Introduces a transformer encoder that ingests patches taken at different magnifications and fuses them in a shared world‑coordinate system.
- Rotary positional embeddings for coordinates: Extends rotary embeddings to encode absolute spatial positions (in microns or pixels), allowing the model to understand where each patch belongs in the original slide.
- Scale‑consistent pre‑training (Multi‑resolution MAE): Adapts Masked Auto‑Encoder pre‑training to multi‑resolution data, producing representations that remain coherent across scales.
- Comprehensive evaluation: Demonstrates consistent gains on synthetic benchmarks, kidney histopathology classification, and high‑resolution mouse‑brain imaging, outperforming strong ViT and CNN baselines.
- Open‑source implementation: Provides code and pretrained weights, facilitating adoption in microscopy pipelines.
Methodology
- Patch extraction at multiple magnifications – From a gigapixel slide, the authors sample overlapping patches at, for example, 5×, 10×, and 20×. Each patch retains its world‑coordinate (the physical location on the slide).
- Shared embedding space – All patches are linearly projected into a common token space, regardless of resolution.
- Rotary world‑coordinate embeddings – Instead of the usual 2‑D sinusoidal or learned positional encodings, the model uses rotary embeddings that rotate token vectors according to their absolute (x, y) coordinates. This makes attention aware of real‑world distances, not just token indices.
- Unified transformer encoder – A standard ViT encoder processes the mixed‑resolution token set. Because the positional encoding reflects true geometry, the self‑attention layers can naturally combine a low‑resolution context token with a high‑resolution detail token.
- Multi‑resolution MAE pre‑training – During self‑supervised pre‑training, random patches are masked across all scales, and the model learns to reconstruct the missing pixels. This forces the encoder to learn representations that are consistent whether you look at a coarse or fine view.
The overall pipeline is simple: extract multi‑scale patches → embed with world‑coordinate rotary encodings → feed to a ViT encoder → downstream head (classification, segmentation, etc.).
Results & Findings
| Dataset | Task | Baseline (ViT‑B/16) | MuViT (Ours) | Relative Gain |
|---|---|---|---|---|
| Synthetic multi‑scale benchmark | Multi‑scale classification | 78.3 % | 84.7 % | +6.4 % |
| Kidney histopathology (TCGA) | Tumor vs. normal | 91.2 % | 94.5 % | +3.3 % |
| Mouse brain (Allen Institute) | Cell‑type segmentation | 0.71 IoU | 0.78 IoU | +7 % |
Key observations
- Attention learns cross‑scale relationships – Visualizing attention maps shows low‑resolution tokens providing global context while high‑resolution tokens focus on fine structures.
- Pre‑training matters – Multi‑resolution MAE yields a ~2 % boost over training from scratch, confirming that scale‑consistent representations are beneficial.
- Efficiency – Because the model processes a modest number of tokens (e.g., 256 patches total) rather than the full gigapixel image, inference remains tractable on a single GPU.
Practical Implications
- Accelerated pathology workflows – Labs can feed whole‑slide images into a single model instead of stitching together separate low‑ and high‑magnification analyses, reducing engineering overhead.
- Better ROI selection – By jointly considering context and detail, MuViT can more reliably flag regions of interest for downstream manual review or targeted high‑resolution scanning.
- Transferable pre‑trained models – The released multi‑resolution MAE weights can serve as a foundation for a variety of microscopy tasks (cell counting, phenotype classification, spatial transcriptomics alignment).
- Scalable to other domains – Any field with multi‑scale imagery—satellite remote sensing, autonomous driving (wide‑angle + zoom lenses), or industrial inspection—can adopt the world‑coordinate rotary embedding trick with minimal changes.
Limitations & Future Work
- Patch selection strategy – The current approach samples patches uniformly; adaptive sampling (e.g., focusing on tissue boundaries) could further reduce token count.
- Memory scaling with many resolutions – Adding more magnifications linearly increases token numbers; hierarchical or sparse attention mechanisms might be needed for extreme scale‑ups.
- Domain shift – While the authors test on several microscopy modalities, performance on completely different staining protocols or imaging modalities (e.g., electron microscopy) remains to be validated.
- Explainability – Although attention visualizations are informative, rigorous interpretability tools for multi‑resolution transformers are still an open research area.
The authors suggest exploring learned coordinate systems (instead of fixed world coordinates) and integrating downstream segmentation heads directly into the transformer for end‑to‑end training.
Authors
- Albert Dominguez Mantes
- Gioele La Manno
- Martin Weigert
Paper Information
- arXiv ID: 2602.24222v1
- Categories: cs.CV, cs.LG
- Published: February 27, 2026
- PDF: Download PDF