[Paper] MedMO: Grounding and Understanding Multimodal Large Language Model for Medical Images
Source: arXiv - 2602.06965v1
Overview
MedMO is a new multimodal large language model (MLLM) that bridges the gap between cutting‑edge vision‑language AI and real‑world medical imaging. By training a unified model on massive, domain‑specific radiology, ophthalmology, and pathology data, the authors demonstrate that a single system can answer visual questions, generate diagnostic reports, retrieve similar cases, and pinpoint disease locations with bounding‑box precision—capabilities that were previously scattered across specialized tools.
Key Contributions
- Domain‑focused pretraining: Aligns multiple visual encoders (CT, fundus, microscopy) with a medical‑language backbone using only publicly available medical image‑text pairs.
- Comprehensive instruction tuning: Covers five core tasks—image captioning, visual QA, report generation, image‑text retrieval, and grounded disease localization.
- Reinforcement learning with verifiable rewards: Introduces a dual‑reward scheme (factuality + box‑level GIoU) that explicitly teaches the model to reason step‑by‑step and produce spatially accurate outputs.
- Two released model sizes (4B & 8B parameters): Enables developers to pick a lightweight version for edge deployment or a larger one for research‑grade performance.
- Cross‑modality generalization: Validated on radiology, ophthalmology, and pathology datasets, showing consistent gains over existing open‑source medical MLLMs.
Methodology
-
Cross‑modal pretraining – Visual encoders (e.g., a ResNet‑based CT encoder, a Swin‑Transformer for fundus images) are frozen initially and jointly trained with a medical language model (based on LLaMA) to learn a shared embedding space. This step ensures that visual features can be “spoken about” in natural language.
-
Instruction tuning – The model is exposed to a curated set of prompts that mimic real clinical workflows:
- Captioning: “Describe the findings in this chest X‑ray.”
- VQA: “Is there evidence of pneumothorax?”
- Report generation: “Write a radiology report for this image.”
- Retrieval: “Find similar cases to this slide.”
- Grounded localization: “Draw a box around the lesion.”
The supervision comes from expert‑annotated datasets, providing both textual answers and bounding‑box labels.
-
Reinforcement learning with verifiable rewards – After instruction tuning, the model is fine‑tuned with PPO. Two reward signals guide learning:
- Factuality reward – A separate verifier checks whether generated text aligns with known medical facts (e.g., using a knowledge base or rule‑based checks).
- Spatial reward – The Intersection‑over‑Union (GIoU) between predicted and ground‑truth boxes is computed; higher overlap yields higher reward.
This dual‑reward loop pushes the model toward both accurate reasoning and precise visual grounding.
Results & Findings
| Task | Metric | MedMO‑4B | MedMO‑8B | Baseline Open‑Source MLLM | Fleming‑VL (SOTA) |
|---|---|---|---|---|---|
| Visual QA (radiology) | Accuracy ↑ | +13.7 % over baseline | – | – | within 1.9 % of SOTA |
| Text‑based QA | Accuracy ↑ | +6.9 % over baseline | – | – | +14.5 % over Fleming‑VL |
| Report Generation | Clinical BLEU / CheXbert F1 ↑ | Significant gains (≈+12 % BLEU) | – | – | – |
| Grounded Localization | IoU ↑ | +40.4 % over baseline | – | – | +37.0 % over Fleming‑VL |
| Cross‑modality (radiology, ophthalmology, pathology) | Consistent improvement across all datasets | ✓ | ✓ | ✗ | ✗ |
Takeaway: MedMO not only beats existing open‑source medical MLLMs by a wide margin but also closes the performance gap with the proprietary state‑of‑the‑art Fleming‑VL, especially in spatial reasoning—a critical factor for clinical decision support.
Practical Implications
- Clinical decision support: Radiologists can query images (“Is there a pleural effusion?”) and receive both a concise answer and a highlighted region, reducing time spent on manual inspection.
- Automated reporting: Hospitals can generate first‑draft radiology or pathology reports that already meet semantic and clinical accuracy thresholds, freeing clinicians to focus on interpretation rather than dictation.
- Case‑based learning & education: Medical trainees can retrieve similar historical cases with visual explanations, accelerating learning curves.
- Edge deployment: The 4B version fits on modern GPUs (e.g., RTX 3080) enabling on‑premise deployment in hospitals with strict data‑privacy policies.
- Multi‑specialty integration: Because the model handles CT, fundus, and microscopy images, a single AI service can be offered across radiology, ophthalmology, and pathology departments, simplifying infrastructure and maintenance.
Limitations & Future Work
- Data bias: Training data are sourced from publicly available repositories, which may under‑represent rare diseases or under‑served populations, potentially limiting generalization.
- Explainability beyond boxes: While bounding‑box grounding is a step forward, clinicians often need richer explanations (e.g., heatmaps, textual rationales) that are not fully addressed.
- Regulatory readiness: The model has not undergone formal clinical validation or FDA‑style evaluation, so deployment in production settings will require additional safety studies.
- Future directions: The authors plan to incorporate multimodal self‑supervision from unlabelled hospital PACS archives, expand to 3‑D imaging (MRI/CT volumes), and integrate structured knowledge graphs for deeper reasoning.
Authors
- Ankan Deria
- Komal Kumar
- Adinath Madhavrao Dukre
- Eran Segal
- Salman Khan
- Imran Razzak
Paper Information
- arXiv ID: 2602.06965v1
- Categories: cs.CV
- Published: February 6, 2026
- PDF: Download PDF