[Paper] A Multicenter Benchmark of Multiple Instance Learning Models for Lymphoma Subtyping from HE-stained Whole Slide Images
Source: arXiv - 2512.14640v1
Overview
This paper introduces the first multicenter benchmark for lymphoma subtyping directly from routine H&E‑stained whole‑slide images (WSIs). By evaluating several state‑of‑the‑art pathology foundation models and multiple‑instance learning (MIL) aggregators across different image magnifications, the authors expose both the promise and the current generalization limits of deep‑learning‑driven diagnostics in a real‑world, multi‑institutional setting.
Key Contributions
- New multicenter dataset covering four common lymphoma subtypes plus healthy tissue, collected from several pathology labs.
- Systematic evaluation of five publicly available pathology foundation models (H‑optimus‑1, H0‑mini, Virchow2, UNI2, Titan) combined with two MIL aggregators (attention‑based AB‑MIL and transformer‑based TransMIL).
- Magnification study comparing 10×, 20×, and 40× WSIs, showing that 40× is sufficient and higher resolutions bring no extra benefit.
- Open benchmarking pipeline (code, data splits, evaluation scripts) released to enable reproducible future research.
- Insight into generalization: in‑distribution balanced accuracy > 80 % but out‑of‑distribution drops to ~60 %, highlighting the need for broader data diversity.
Methodology
- Data preparation – Whole‑slide images from multiple centers were digitized at three standard magnifications (10×, 20×, 40×). Each slide was tiled into non‑overlapping patches (≈224 px) and labeled at the slide level with one of five classes (four lymphoma subtypes + normal).
- Feature extraction – Pre‑trained pathology foundation models (the five listed above) were used as frozen encoders to convert each patch into a compact feature vector. This avoids costly end‑to‑end training and mirrors typical “transfer‑learning” workflows in medical imaging.
- Multiple‑Instance Learning – Since only slide‑level labels are available, MIL aggregates patch features into a slide‑level prediction. Two aggregators were tested:
- AB‑MIL – an attention‑based pooling layer that learns to weight the most informative patches.
- TransMIL – a transformer‑style encoder that captures interactions among patches before pooling.
- Training & Evaluation – Models were trained on a stratified in‑distribution (ID) split, validated, and finally tested on both ID and an out‑of‑distribution (OOD) hold‑out set from unseen centers. Balanced accuracy (average per‑class recall) was the primary metric.
- Benchmark pipeline – All steps (tiling, feature extraction, MIL training, evaluation) are scripted in a reproducible Docker‑based workflow, enabling other researchers to plug in new encoders or aggregators with minimal effort.
Results & Findings
| Magnification | Aggregator | Balanced Accuracy (ID) | Balanced Accuracy (OOD) |
|---|---|---|---|
| 10× | AB‑MIL / TransMIL | 81 % – 84 % | 58 % – 62 % |
| 20× | AB‑MIL / TransMIL | 82 % – 85 % | 59 % – 63 % |
| 40× | AB‑MIL / TransMIL | 84 % – 87 % | 60 % – 64 % |
- Foundation models performed similarly; no single encoder dominated across magnifications.
- AB‑MIL vs. TransMIL: performance differences were marginal (< 2 %); both are viable choices.
- Magnification effect: 40× gave the best ID scores, but moving to higher magnifications (e.g., 60×) offered no measurable gain.
- Generalization gap: OOD accuracy consistently lagged ~20 % behind ID, indicating that models overfit to site‑specific staining, scanner, or preprocessing quirks.
Practical Implications
- Rapid triage tool: A plug‑and‑play MIL pipeline can be integrated into digital pathology workflows to flag suspicious slides for expert review, potentially shaving days off the diagnostic timeline.
- Hardware budgeting: Since 40× scans are sufficient, labs can avoid the storage and compute overhead of ultra‑high‑resolution WSIs.
- Model selection flexibility: Developers can choose any of the five released encoders (or their own) without fearing major performance loss, simplifying deployment pipelines.
- Cross‑institutional collaborations: The benchmark highlights the need for shared, diverse data; vendors of whole‑slide scanners and pathology informatics platforms can use the pipeline to validate their products across sites.
- Regulatory pathways: Balanced accuracy > 80 % on ID data meets early‑stage performance thresholds for AI‑assisted diagnostic tools, but the OOD drop underscores the necessity of extensive multi‑site validation before clinical clearance.
Limitations & Future Work
- Dataset scope: Only four common lymphoma subtypes plus normal tissue were included; rare subtypes remain untested.
- Label granularity: Slide‑level labels ignore intra‑slide heterogeneity, which could be leveraged by more fine‑grained MIL or segmentation approaches.
- Domain shift: The OOD performance drop signals that current models are sensitive to staining protocols and scanner differences; domain‑adaptation or stain‑normalization techniques need exploration.
- Computational cost: While the encoders are frozen, processing millions of patches per slide still demands substantial GPU resources; smarter patch selection (e.g., coarse‑to‑fine attention) could reduce overhead.
- Clinical integration: The study stops at algorithmic performance; future work should involve prospective trials, user‑interface design for pathologists, and cost‑benefit analyses.
By openly releasing the data splits, code, and evaluation scripts, the authors lay a solid foundation for the community to address these challenges and move AI‑assisted lymphoma diagnostics from research prototypes toward real‑world impact.
Authors
- Rao Muhammad Umer
- Daniel Sens
- Jonathan Noll
- Christian Matek
- Lukas Wolfseher
- Rainer Spang
- Ralf Huss
- Johannes Raffler
- Sarah Reinke
- Wolfram Klapper
- Katja Steiger
- Kristina Schwamborn
- Carsten Marr
Paper Information
- arXiv ID: 2512.14640v1
- Categories: cs.CV, cs.AI
- Published: December 16, 2025
- PDF: Download PDF