[Paper] Med-Scout: Curing MLLMs' Geometric Blindness in Medical Perception via Geometry-Aware RL Post-Training

Published: 1 week ago (January 30, 2026 at 12:45 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.23220v1

Overview

The paper Med‑Scout tackles a hidden flaw in today’s multimodal large language models (MLLMs) for medicine: they can “see” an image but often ignore its geometry, leading to confident yet factually wrong diagnoses. By introducing a geometry‑aware reinforcement‑learning (RL) post‑training step that extracts supervision from the images themselves, the authors dramatically improve the models’ spatial reasoning without any extra expert labeling.

Key Contributions

Med‑Scout framework – a lightweight RL‑based post‑training pipeline that injects geometric awareness into any pre‑trained MLLM.
Three proxy tasks that turn raw medical images into self‑supervised signals:
1. Hierarchical Scale Localization – learns absolute and relative size cues.
2. Topological Jigsaw Reconstruction – forces the model to understand spatial arrangement by re‑ordering shuffled image patches.
3. Anomaly Consistency Detection – checks whether detected lesions respect plausible geometric constraints.
Med‑Scout‑Bench – a new benchmark that isolates geometric perception from pure language ability, exposing “geometric blindness” in existing models.
Empirical gains – >40 % improvement over state‑of‑the‑art MLLMs on the benchmark, plus consistent lifts on standard radiology VQA and comprehensive medical QA datasets.
Annotation‑free – the approach requires no additional radiologist annotations, making it cheap to scale across modalities and institutions.

Methodology

Base Model – Start with any off‑the‑shelf MLLM (e.g., GPT‑4‑Vision, LLaVA‑Med) that already has strong language grounding.
Self‑Supervised Signal Extraction
- Scale Localization: The image is down‑sampled at multiple resolutions; the model predicts the correct scale level for each region, learning absolute size relationships.
- Jigsaw Reconstruction: Images are split into a grid, shuffled, and the model must output the correct ordering, encouraging it to infer adjacency and topology.
- Anomaly Consistency: Synthetic lesions are inserted or masked; the model receives a binary reward for correctly flagging geometrically impossible configurations.
RL Fine‑Tuning – Each proxy task defines a reward function (e.g., +1 for correct ordering, –1 for violations). Using Proximal Policy Optimization (PPO), the MLLM’s policy (its multimodal encoder‑decoder) is updated to maximize these rewards while preserving language fluency via a KL‑regularization term.
Joint Training – The three tasks are interleaved, so the model simultaneously learns scale, topology, and consistency. Because the signals come directly from the image data, no human labels are needed.

Results & Findings

Model (pre‑post‑train)	Med‑Scout‑Bench ↑ (Δ%)	Radiology VQA (overall)	Comprehensive Med‑QA
GPT‑4‑Vision (baseline)	58.2 %	71.4 %	68.9 %
GPT‑4‑Vision + Med‑Scout	82.7 % (+44 %)	78.3 % (+6.9 pp)	74.5 % (+5.6 pp)
LLaVA‑Med (baseline)	55.0 %	68.1 %	66.2 %
LLaVA‑Med + Med‑Scout	81.1 % (+47 %)	76.0 % (+7.9 pp)	73.0 % (+6.8 pp)

Geometric blind spots disappear – The RL‑trained models correctly localize lesions, respect organ boundaries, and avoid impossible size predictions.
Transferable gains – Even tasks that are not explicitly geometric (e.g., disease classification from text) see modest accuracy bumps, suggesting that a better spatial foundation improves overall reasoning.
Efficiency – Post‑training converges in ~12 h on a single A100 GPU, with less than 0.5 % of the original model parameters being updated.

Practical Implications

Safer AI‑assisted diagnostics – By grounding answers in geometry, systems are less likely to hallucinate “giant” tumors or misplaced findings, reducing the risk of downstream clinical errors.
Plug‑and‑play upgrade – Developers can take any existing medical MLLM and run the Med‑Scout RL fine‑tuning script to get immediate performance lifts without re‑training from scratch.
Cost‑effective scaling – Since no radiologist annotations are required, hospitals and startups can apply the method to proprietary imaging datasets (CT, MRI, X‑ray) and quickly adapt models to new modalities.
Regulatory friendliness – The explicit geometric validation steps can be logged and audited, helping satisfy emerging AI‑in‑healthcare compliance frameworks that demand traceable reasoning.
Beyond medicine – Any domain where visual geometry matters—autonomous robotics, satellite imagery analysis, CAD‑based design review—could adopt the same proxy‑task + RL recipe.

Limitations & Future Work

Domain specificity – The proxy tasks are tuned for typical radiology images; performance on highly irregular modalities (e.g., histopathology slides) may need task redesign.
Reward shaping sensitivity – The RL component can be unstable if reward magnitudes are not balanced; the authors note occasional “policy collapse” when scaling to very large models.
Interpretability – While geometry improves factuality, the model’s internal reasoning remains a black box; future work could integrate explicit spatial graphs for better explainability.
Clinical validation – The paper reports benchmark improvements, but real‑world prospective studies with clinicians are still pending.

Med‑Scout demonstrates that a modest, annotation‑free RL fine‑tuning step can cure a fundamental blind spot in medical AI, opening a practical path for developers to build more trustworthy, geometry‑aware multimodal systems.

Authors

Anglin Liu
Ruichao Chen
Yi Lu
Hongxia Xu
Jintai Chen

Paper Information

arXiv ID: 2601.23220v1
Categories: cs.CV, cs.AI
Published: January 30, 2026
PDF: Download PDF

[Paper] Med-Scout: Curing MLLMs' Geometric Blindness in Medical Perception via Geometry-Aware RL Post-Training

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation

[Paper] Denoising the Deep Sky: Physics-Based CCD Noise Formation for Astronomical Imaging

[Paper] Training-Free Test-Time Adaptation with Brownian Distance Covariance in Vision-Language Models

[Paper] ShotFinder: Imagination-Driven Open-Domain Video Shot Retrieval via Web Search