[Paper] Scanner-Induced Domain Shifts Undermine the Robustness of Pathology Foundation Models
Source: arXiv - 2601.04163v1
Overview
Pathology foundation models (PFMs) promise to be universal feature extractors for whole‑slide images (WSIs), enabling a wide range of downstream analyses in computational pathology. This study uncovers a hidden weakness: PFMs are surprisingly sensitive to the type of scanner that digitizes the tissue, which can jeopardize their reliability in real‑world clinical workflows.
Key Contributions
- Systematic scanner‑shift benchmark: Evaluated 14 PFMs—including the latest vision‑language models, earlier self‑supervised encoders, and a natural‑image baseline—on a curated multi‑scanner breast‑cancer dataset (384 WSIs from five different scanners).
- Dual evaluation strategy: Combined unsupervised embedding analyses (visualizing and quantifying scanner‑specific clustering) with supervised clinicopathological tasks (e.g., tumor grade, hormone‑receptor status) to assess robustness.
- Evidence of hidden bias: Showed that while classification AUCs often stay stable across scanners, the underlying embeddings shift, leading to systematic calibration errors and scanner‑dependent prediction bias.
- No simple robustness predictor: Demonstrated that larger training corpora, newer architectures, or bigger model sizes do not guarantee scanner invariance.
- Insight on vision‑language models: These models, trained on the most heterogeneous data, exhibit relatively better embedding stability but still lag on downstream task performance.
- Call for new evaluation standards: Argues that robustness to acquisition variability must be a first‑class metric when developing and benchmarking PFMs.
Methodology
- Dataset construction – 384 breast‑cancer WSIs were digitized on five commercial scanners (e.g., Aperio, Hamamatsu, Leica). All other variables (tissue block, staining protocol, patient cohort) were held constant to isolate the scanner effect.
- Model suite – The authors selected 14 publicly available PFMs:
- Recent vision‑language models (e.g., CLIP‑based encoders)
- State‑of‑the‑art self‑supervised pathology models (e.g., SimCLR, MoCo variants)
- Earlier self‑supervised models and a ResNet‑50 pretrained on ImageNet as a natural‑image baseline.
- Embedding analysis – For each model, tile‑level embeddings were extracted from all WSIs. Dimensionality reduction (UMAP/t‑SNE) and clustering metrics (Silhouette score, k‑NN purity) quantified how much embeddings grouped by scanner rather than by biological label.
- Supervised downstream tasks – Linear probes were trained on embeddings to predict clinically relevant outcomes (e.g., ER/PR status, tumor grade). Performance (AUC) and calibration (Brier score, reliability diagrams) were measured separately for each scanner.
- Statistical controls – Mixed‑effects models accounted for repeated measurements from the same patient and for potential residual confounders.
Results & Findings
- Scanner‑specific embedding clusters: Most PFMs produced embeddings that clearly separated by scanner (average Silhouette ≈ 0.35), indicating that scanner characteristics dominate the latent space.
- AUC stability masks calibration drift: Across scanners, AUCs for tasks such as ER status varied by < 2 %, yet calibration metrics deteriorated markedly (Brier score increase up to 0.12). This means predictions become over‑ or under‑confident depending on the scanner.
- No correlation with model size or data volume: Large models (≈ 300 M parameters) and those trained on > 10 M patches did not outperform smaller, older models in terms of scanner invariance.
- Vision‑language models fare slightly better: CLIP‑based encoders showed the lowest scanner clustering (Silhouette ≈ 0.18) but achieved lower downstream AUCs (≈ 0.78 vs. ≈ 0.84 for the best self‑supervised model).
- Baseline ImageNet model performed worst: It exhibited the strongest scanner bias and the poorest downstream task results, confirming that natural‑image pretraining is insufficient for pathology.
Practical Implications
- Deployment caution: Clinics cannot assume that a PFM validated on one scanner will behave identically on another; hidden calibration shifts could lead to systematic over‑diagnosis or missed cases.
- Model selection trade‑offs: Choosing a model solely on benchmark AUC may be risky; developers should also examine embedding stability and calibration across expected scanner fleets.
- Need for scanner‑aware pipelines: Incorporating scanner metadata as an explicit covariate, or applying domain‑adaptation techniques (e.g., adversarial alignment, style transfer) before embedding extraction, can mitigate bias.
- Testing standards: Vendors and research groups should adopt multi‑scanner validation suites as part of regulatory submissions or open‑source releases, similar to cross‑site validation in radiology AI.
- Opportunity for tooling: The community can build open libraries that automatically assess embedding drift (e.g., “ScannerShift‑Check”) and suggest corrective fine‑tuning steps, lowering the barrier for robust PFM adoption.
Limitations & Future Work
- Scope limited to breast cancer WSIs: While the multi‑scanner design isolates scanner effects, other tissue types and staining protocols may exhibit different sensitivities.
- Fixed preprocessing pipeline: The study used a single tiling and color‑normalization strategy; alternative pipelines could interact with scanner bias in unpredictable ways.
- No end‑to‑end fine‑tuning: The authors evaluated frozen encoders; future work should explore whether modest fine‑tuning on a small, scanner‑balanced set can restore calibration.
- Broader acquisition variables: Beyond scanner hardware, factors like compression level, file format, and scanning speed were not examined and could compound the observed shifts.
Bottom line: The paper shines a light on a subtle but critical failure mode of pathology foundation models—scanner‑induced domain shift—that can undermine their promise of plug‑and‑play utility. Addressing this issue now will be essential for safe, scalable AI deployment in digital pathology.
Authors
- Erik Thiringer
- Fredrik K. Gustafsson
- Kajsa Ledesma Eriksson
- Mattias Rantalainen
Paper Information
- arXiv ID: 2601.04163v1
- Categories: eess.IV, cs.CV, cs.LG
- Published: January 7, 2026
- PDF: Download PDF