[Paper] Wasserstein-Aligned Localisation for VLM-Based Distributional OOD Detection in Medical Imaging

Published: 4 days ago (May 6, 2026 at 01:32 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2605.05161v1

Overview

Detecting rare pathologies in medical images without any disease‑specific training data is a holy grail for AI‑assisted radiology. The paper introduces WALDO (Wasserstein‑Aligned Localisation for VLM‑Based Distributional OOD Detection), a training‑free framework that turns zero‑shot anomaly localisation into a comparative inference problem. By matching a patient scan against a carefully chosen set of “normal” reference patches, WALDO dramatically improves the ability of large vision‑language models (VLMs) to pinpoint abnormal regions in brain MRI scans.

Key Contributions

Comparative inference formulation – Recasts zero‑shot localisation as a structured comparison between the query image and a distribution of healthy anatomy.
Entropy‑weighted Sliced Wasserstein selection – Uses optimal‑transport distances on DINOv2 patch embeddings to pick anatomically relevant reference patches from a large unlabeled pool.
Goldilocks zone sampling – Theoretically and empirically shows that references with moderate similarity to the query (neither too close nor too far) yield the best bias‑variance trade‑off for anomaly detection.
Self‑consistency aggregation – Combines multiple comparative scores via weighted non‑maximum suppression, producing a robust localisation map without any fine‑tuning.
State‑of‑the‑art zero‑shot performance – On the NOVA brain‑MRI benchmark, WALDO lifts Qwen2.5‑VL‑72B from ~36 % to 43.5 % mAP@30, a 19 % relative gain, and delivers consistent improvements across GPT‑4o and Qwen3‑VL‑32B.
Open‑source release – Full code and demo are provided, enabling immediate experimentation.

Methodology

Patch embedding extraction – The input MRI (or any 2‑D slice) is split into overlapping patches. Each patch is encoded with a frozen vision encoder (e.g., DINOv2) to obtain a high‑dimensional feature vector.
Reference pool construction – A large collection of healthy brain scans is processed the same way, yielding a distribution of normal patch embeddings.
Entropy‑weighted Sliced Wasserstein distance – For each query patch, WALDO computes a sliced Wasserstein distance to the reference distribution, weighting each slice by the entropy of the corresponding DINOv2 token. High‑entropy patches (more “informative”) influence the distance more strongly, ensuring anatomical relevance.
Goldilocks zone sampling – Instead of using the closest references (which can be overly biased) or the most distant ones (which add noise), WALDO selects references whose similarity falls in a middle “Goldilocks” range. The authors prove that this range minimizes the expected error of the comparative estimator.
Comparative scoring with VLM – Each selected reference is paired with the query patch and fed to a frozen VLM (e.g., Qwen2.5‑VL‑72B). The model outputs a similarity score indicating how well the reference explains the query.
Self‑consistency aggregation – Scores from multiple references are merged using a weighted non‑maximum suppression (NMS) that favors consensus among references while suppressing outliers, producing a final anomaly heatmap.
Zero‑shot localisation – The heatmap is thresholded to obtain pixel‑level anomaly masks, all without any task‑specific training.

Results & Findings

Model (VLM)	mAP@30 (± SD)	Relative gain vs. baseline
Qwen2.5‑VL‑72B	43.5 % ± 1.6	+19 %
GPT‑4o	32.0 % ± 6.5	+14 %
Qwen3‑VL‑32B	32.0 % ± 6.6	+14 %

Statistical significance: Paired McNemar tests give p < 0.01 for all improvements.
Ablation studies: Removing entropy weighting or Goldilocks sampling drops performance by ~5–7 %, confirming each component’s contribution.
Cross‑model robustness: The same reference selection pipeline works across VLMs of different sizes and architectures, indicating that the benefit stems from the comparative framework rather than a specific model.

Practical Implications

Rapid deployment in low‑resource settings: Since WALDO requires no fine‑tuning, hospitals can plug in any off‑the‑shelf VLM and start detecting rare anomalies immediately.
Scalable to new modalities: The pipeline only needs a pool of healthy examples; extending to CT, X‑ray, or histopathology is a matter of gathering unlabeled normal scans.
Assistive tool for radiologists: The heatmaps highlight suspicious regions, allowing clinicians to focus their review on a smaller area, potentially reducing reading time and missed lesions.
Regulatory friendliness: Training‑free methods sidestep many data‑privacy concerns because the reference pool can be kept on‑premise and never leaves the institution.
Foundation for hybrid AI systems: WALDO’s comparative reasoning can be combined with lightweight downstream classifiers (e.g., a small CNN) for a two‑stage pipeline that first flags candidate regions and then refines diagnosis.

Limitations & Future Work

Dependence on reference quality: If the healthy pool lacks sufficient anatomical diversity (e.g., age, scanner type), the Wasserstein distances may mischaracterize normal variation, leading to false positives.
Computational overhead: Computing sliced Wasserstein distances for every patch and sampling multiple references can be costly; the authors suggest approximate OT solvers as a speed‑up avenue.
2‑D slice focus: Experiments are limited to 2‑D brain MRI slices; extending to full 3‑D volumes and handling inter‑slice consistency remains an open challenge.
Domain shift for VLMs: While the method works across several VLMs, extreme domain gaps (e.g., non‑medical images) could degrade the VLM’s comparative scoring ability. Future work could explore domain‑adapted prompts or lightweight adapters.
User studies: The paper does not include radiologist usability tests; evaluating how clinicians interact with WALDO’s heatmaps will be crucial for real‑world adoption.

WALDO demonstrates that clever use of optimal‑transport theory and comparative reasoning can unlock the zero‑shot potential of large vision‑language models for medical anomaly localisation—opening a path toward more flexible, data‑efficient AI tools in healthcare.

Authors

Bernhard Kainz
Johanna P Mueller
Matthew Baugh
Cosmin Bercea

Paper Information

arXiv ID: 2605.05161v1
Categories: cs.CV
Published: May 6, 2026
PDF: Download PDF

[Paper] Wasserstein-Aligned Localisation for VLM-Based Distributional OOD Detection in Medical Imaging

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] 123D: Unifying Multi-Modal Autonomous Driving Data at Scale

[Paper] Normalizing Trajectory Models

[Paper] EmambaIR: Efficient Visual State Space Model for Event-guided Image Reconstruction

[Paper] Proxy3D: Efficient 3D Representations for Vision-Language Models via Semantic Clustering and Alignment