[Paper] PathBench-MIL: A Comprehensive AutoML and Benchmarking Framework for Multiple Instance Learning in Histopathology

Published: 1 month ago (December 19, 2025 at 07:35 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.17517v1

Overview

PathBench‑MIL is an open‑source framework that brings automated machine learning (AutoML) and reproducible benchmarking to Multiple Instance Learning (MIL) pipelines for histopathology images. By stitching together data preprocessing, feature extraction, and MIL‑aggregation into a single, configurable workflow, it lets researchers and developers compare dozens of state‑of‑the‑art models on any whole‑slide dataset with just a few lines of code.

Key Contributions

End‑to‑end AutoML pipeline for MIL in digital pathology, covering slide tiling, stain normalization, feature extraction, and aggregation.
Unified configuration system (YAML/CLI) that lets users swap models, extractors, and hyper‑parameters without touching code.
Comprehensive benchmark suite: over 30 MIL architectures (e.g., Attention‑MIL, CLAM, DSMIL) and over 10 feature extractors (ResNet, EfficientNet, Vision Transformers, handcrafted texture descriptors).
Visualization toolbox: attention heatmaps, instance‑level embeddings, and performance dashboards integrated with TensorBoard/Streamlit.
Modular, extensible design: plug‑in new models, datasets, or evaluation metrics via a simple Python API.
Open‑source release under MIT license with detailed documentation and CI‑tested reproducibility.

Methodology

PathBench‑MIL treats a whole‑slide image (WSI) as a bag of instances (small tiles). The workflow proceeds in three stages:

Preprocessing – WSIs are tiled, optionally filtered by tissue detection, and normalized for stain variation using Macenko or Reinhard methods.
Feature Extraction – Each tile is passed through a chosen backbone (CNN, Vision Transformer, or handcrafted descriptor) to obtain a fixed‑length embedding. The framework caches embeddings to avoid redundant computation.
MIL Aggregation – The bag of embeddings is fed to a selected MIL model. The system supports classic pooling (max/mean), attention‑based pooling, graph‑based aggregators, and transformer‑style set encoders. Hyper‑parameters (learning rate, batch size, optimizer) are auto‑tuned via Optuna or Ray Tune.

All components are defined in a declarative YAML file, enabling reproducible runs across different hardware (CPU, single‑GPU, multi‑GPU). The benchmark harness runs each configuration on a given dataset, logs metrics (AUROC, accuracy, F1), and stores results in a SQLite/CSV ledger for downstream analysis.

Results & Findings

Speedup: Automated hyper‑parameter search reduced the time to reach a target AUROC of 0.85 on the Camelyon16 dataset from ~12 hours (manual tuning) to ~3 hours.
Performance ceiling: Across 30+ MIL variants, the best‑performing configuration (EfficientNet‑B3 + Attention‑MIL) achieved AUROC = 0.94, matching or surpassing published state‑of‑the‑art results.
Feature extractor impact: Vision Transformers (ViT‑B/16) consistently outperformed ResNet‑50 on heterogeneous staining patterns, but required more GPU memory; the framework’s caching mitigated this overhead.
Reproducibility: Running the same benchmark on three separate machines yielded <0.3 % variance in AUROC, confirming deterministic data splits and seed handling.
Usability: A user study with 12 pathology labs reported a 70 % reduction in setup time for new experiments compared to ad‑hoc scripts.

Practical Implications

Rapid prototyping: Developers can spin up a full MIL experiment in minutes, allowing faster iteration on novel architectures or domain‑specific augmentations.
Standardized evaluation: Companies building AI‑assisted diagnostic tools can benchmark their models against a common baseline, facilitating regulatory documentation and cross‑partner collaborations.
Resource optimization: Cached embeddings and built‑in hyper‑parameter search reduce wasted GPU cycles, lowering cloud compute costs.
Educational tool: The visualization suite makes it easy to explain model decisions (e.g., attention heatmaps) to clinicians, bridging the “black‑box” gap.
Extensibility to other domains: Because MIL is generic, PathBench‑MIL can be repurposed for radiology, satellite imagery, or any task where labels exist at the bag level but not the instance level.

Limitations & Future Work

Scalability to ultra‑large cohorts: While caching helps, handling millions of tiles still demands high‑capacity storage; future versions will integrate distributed data stores (e.g., Dask, Parquet).
Limited support for weakly‑supervised labels: Current benchmarks assume binary slide‑level labels; extending to multi‑class or ordinal outcomes is on the roadmap.
GPU memory constraints: Transformer‑based extractors are memory‑hungry; planned optimizations include mixed‑precision training and gradient checkpointing.
Domain shift handling: The framework does not yet provide automated stain‑style transfer or domain adaptation modules—future releases aim to plug in such techniques.

PathBench‑MIL positions itself as the “one‑stop shop” for anyone looking to experiment with MIL in histopathology, turning what used to be a multi‑week engineering effort into a reproducible, plug‑and‑play workflow. Check it out on GitHub and start scaling your pathology AI projects today.

Authors

Siemen Brussee
Pieter A. Valkema
Jurre A. J. Weijer
Thom Doeleman
Anne M. R. Schrader
Jesper Kers

Paper Information

arXiv ID: 2512.17517v1
Categories: cs.CV, cs.LG, cs.NE, cs.SE, q-bio.TO
Published: December 19, 2025
PDF: Download PDF

[Paper] PathBench-MIL: A Comprehensive AutoML and Benchmarking Framework for Multiple Instance Learning in Histopathology

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Re-Depth Anything: Test-Time Depth Refinement via Self-Supervised Re-lighting

[Paper] Adversarial Robustness of Vision in Open Foundation Models

[Paper] RadarGen: Automotive Radar Point Cloud Generation from Cameras

[Paper] Visually Prompted Benchmarks Are Surprisingly Fragile