[Paper] Leveraging whole slide difficulty in Multiple Instance Learning to improve prostate cancer grading
Source: arXiv - 2603.09953v1
Overview
This paper tackles a common pain point in computational pathology: whole‑slide images (WSIs) are often labeled by expert pathologists, but the difficulty of interpreting a slide varies widely. By quantifying that “slide difficulty” from the disagreement between experts and non‑experts, the authors show how to make Multiple Instance Learning (MIL) models for prostate cancer Gleason grading more robust—especially on the toughest, high‑grade cases.
Key Contributions
- Whole Slide Difficulty (WSD) metric – a simple, data‑driven score derived from expert vs. non‑expert annotation disagreement.
- Two training strategies to exploit WSD:
- Multi‑task learning – the model jointly predicts the cancer grade and the slide difficulty.
- Weighted loss – the classification loss is scaled by the WSD, giving harder slides more influence during training.
- Extensive empirical validation on prostate cancer WSIs, demonstrating consistent performance gains across multiple MIL backbones (e.g., Attention‑MIL, CLAM) and feature encoders (ResNet‑50, EfficientNet).
- Focused improvement on high Gleason grades, which are clinically the most critical and historically the hardest for AI models to classify correctly.
Methodology
-
Data & Difficulty Annotation
- A set of prostate WSIs was annotated by a senior pathologist (ground truth) and a junior pathologist.
- For each slide, the WSD score is computed as the binary disagreement (0 = agreement, 1 = disagreement) or a normalized count when multiple non‑experts are involved.
-
MIL Framework
- WSIs are split into thousands of image patches (instances).
- A pretrained CNN extracts a feature vector for each patch.
- An MIL aggregator (e.g., attention‑based pooling) produces a slide‑level representation that feeds into a classifier.
-
Integrating WSD
- Multi‑task: The network has two heads—one for Gleason grade prediction, another for a binary difficulty prediction. The total loss is a weighted sum of the two tasks, encouraging the shared backbone to learn features that are informative for both.
- Weighted loss: The standard cross‑entropy loss for Gleason grading is multiplied by a factor proportional to the slide’s WSD (harder slides → larger weight).
-
Training & Evaluation
- Experiments were run with 5‑fold cross‑validation.
- Metrics: macro‑averaged F1, weighted accuracy, and per‑grade recall, with special attention to grades 4/5 (high‑grade cancer).
Results & Findings
| Setup | Macro‑F1 ↑ | Weighted Acc ↑ | Grade 4/5 Recall ↑ |
|---|---|---|---|
| Baseline MIL (no WSD) | 0.71 | 0.84 | 0.62 |
| + Multi‑task WSD | 0.75 (+5.6%) | 0.88 (+4.8%) | 0.71 (+14.5%) |
| + Weighted‑loss WSD | 0.74 (+4.2%) | 0.87 (+3.6%) | 0.68 (+9.7%) |
- Both WSD‑aware strategies beat the vanilla MIL baseline across all encoders.
- Gains are most pronounced for the worst‑case grades, reducing false negatives that could miss aggressive tumors.
- The multi‑task variant slightly outperforms the weighted‑loss approach, suggesting that explicitly modeling difficulty helps the network learn richer representations.
Practical Implications
- Better triage tools – Pathology labs can deploy MIL models that are less likely to miss high‑grade prostate cancer, improving patient safety.
- Training data efficiency – By weighting harder slides, developers can achieve higher performance without collecting dramatically more data, saving annotation costs.
- Generalizable recipe – The WSD concept is not limited to prostate cancer; any histopathology task with expert vs. non‑expert disagreement (e.g., breast, lung) can adopt the same multi‑task or weighted‑loss framework.
- Model interpretability – The difficulty head provides a confidence signal that can be surfaced to clinicians, helping them decide when to request a second opinion.
- Integration into pipelines – Since the approach only adds a lightweight auxiliary head or loss scaling, it fits into existing MIL pipelines (e.g., PyTorch‑based CLAM) with minimal engineering overhead.
Limitations & Future Work
- Binary difficulty definition – The current WSD is a simple disagreement flag; richer difficulty signals (e.g., continuous uncertainty, multi‑rater consensus) could capture nuance.
- Dependence on non‑expert quality – If the non‑expert annotator is poorly trained, the WSD may be noisy, potentially harming performance.
- Scope limited to prostate Gleason grading – While results are promising, validation on other cancer types and multi‑institutional datasets is needed to confirm generality.
- Scalability to ultra‑large cohorts – The study used a modest number of slides; future work should test the approach on tens of thousands of WSIs to assess computational overhead and robustness.
Bottom line: By turning “hard‑to‑diagnose” slides into a learning signal rather than a nuisance, this work offers a practical, low‑cost upgrade for MIL‑based pathology models—something developers can start experimenting with today.
Authors
- Marie Arrivat
- Rémy Peyret
- Elsa Angelini
- Pietro Gori
Paper Information
- arXiv ID: 2603.09953v1
- Categories: cs.CV
- Published: March 10, 2026
- PDF: Download PDF