[Paper] MedMO: Grounding and Understanding Multimodal Large Language Model for Medical Images

Published: 3 days ago (February 6, 2026 at 01:59 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.06965v1

Overview

MedMO is a new multimodal large language model (MLLM) that bridges the gap between cutting‑edge vision‑language AI and real‑world medical imaging. By training a unified model on massive, domain‑specific radiology, ophthalmology, and pathology data, the authors demonstrate that a single system can answer visual questions, generate diagnostic reports, retrieve similar cases, and pinpoint disease locations with bounding‑box precision—capabilities that were previously scattered across specialized tools.

Key Contributions

Domain‑focused pretraining: Aligns multiple visual encoders (CT, fundus, microscopy) with a medical‑language backbone using only publicly available medical image‑text pairs.
Comprehensive instruction tuning: Covers five core tasks—image captioning, visual QA, report generation, image‑text retrieval, and grounded disease localization.
Reinforcement learning with verifiable rewards: Introduces a dual‑reward scheme (factuality + box‑level GIoU) that explicitly teaches the model to reason step‑by‑step and produce spatially accurate outputs.
Two released model sizes (4B & 8B parameters): Enables developers to pick a lightweight version for edge deployment or a larger one for research‑grade performance.
Cross‑modality generalization: Validated on radiology, ophthalmology, and pathology datasets, showing consistent gains over existing open‑source medical MLLMs.

Methodology

Cross‑modal pretraining – Visual encoders (e.g., a ResNet‑based CT encoder, a Swin‑Transformer for fundus images) are frozen initially and jointly trained with a medical language model (based on LLaMA) to learn a shared embedding space. This step ensures that visual features can be “spoken about” in natural language.
Instruction tuning – The model is exposed to a curated set of prompts that mimic real clinical workflows:
- Captioning: “Describe the findings in this chest X‑ray.”
- VQA: “Is there evidence of pneumothorax?”
- Report generation: “Write a radiology report for this image.”
- Retrieval: “Find similar cases to this slide.”
- Grounded localization: “Draw a box around the lesion.”
  The supervision comes from expert‑annotated datasets, providing both textual answers and bounding‑box labels.
Reinforcement learning with verifiable rewards – After instruction tuning, the model is fine‑tuned with PPO. Two reward signals guide learning:
- Factuality reward – A separate verifier checks whether generated text aligns with known medical facts (e.g., using a knowledge base or rule‑based checks).
- Spatial reward – The Intersection‑over‑Union (GIoU) between predicted and ground‑truth boxes is computed; higher overlap yields higher reward.
  This dual‑reward loop pushes the model toward both accurate reasoning and precise visual grounding.

Results & Findings

Task	Metric	MedMO‑4B	MedMO‑8B	Baseline Open‑Source MLLM	Fleming‑VL (SOTA)
Visual QA (radiology)	Accuracy ↑	+13.7 % over baseline	–	–	within 1.9 % of SOTA
Text‑based QA	Accuracy ↑	+6.9 % over baseline	–	–	+14.5 % over Fleming‑VL
Report Generation	Clinical BLEU / CheXbert F1 ↑	Significant gains (≈+12 % BLEU)	–	–	–
Grounded Localization	IoU ↑	+40.4 % over baseline	–	–	+37.0 % over Fleming‑VL
Cross‑modality (radiology, ophthalmology, pathology)	Consistent improvement across all datasets	✓	✓	✗	✗

Takeaway: MedMO not only beats existing open‑source medical MLLMs by a wide margin but also closes the performance gap with the proprietary state‑of‑the‑art Fleming‑VL, especially in spatial reasoning—a critical factor for clinical decision support.

Practical Implications

Clinical decision support: Radiologists can query images (“Is there a pleural effusion?”) and receive both a concise answer and a highlighted region, reducing time spent on manual inspection.
Automated reporting: Hospitals can generate first‑draft radiology or pathology reports that already meet semantic and clinical accuracy thresholds, freeing clinicians to focus on interpretation rather than dictation.
Case‑based learning & education: Medical trainees can retrieve similar historical cases with visual explanations, accelerating learning curves.
Edge deployment: The 4B version fits on modern GPUs (e.g., RTX 3080) enabling on‑premise deployment in hospitals with strict data‑privacy policies.
Multi‑specialty integration: Because the model handles CT, fundus, and microscopy images, a single AI service can be offered across radiology, ophthalmology, and pathology departments, simplifying infrastructure and maintenance.

Limitations & Future Work

Data bias: Training data are sourced from publicly available repositories, which may under‑represent rare diseases or under‑served populations, potentially limiting generalization.
Explainability beyond boxes: While bounding‑box grounding is a step forward, clinicians often need richer explanations (e.g., heatmaps, textual rationales) that are not fully addressed.
Regulatory readiness: The model has not undergone formal clinical validation or FDA‑style evaluation, so deployment in production settings will require additional safety studies.
Future directions: The authors plan to incorporate multimodal self‑supervision from unlabelled hospital PACS archives, expand to 3‑D imaging (MRI/CT volumes), and integrate structured knowledge graphs for deeper reasoning.

Authors

Ankan Deria
Komal Kumar
Adinath Madhavrao Dukre
Eran Segal
Salman Khan
Imran Razzak

Paper Information

arXiv ID: 2602.06965v1
Categories: cs.CV
Published: February 6, 2026
PDF: Download PDF

[Paper] MedMO: Grounding and Understanding Multimodal Large Language Model for Medical Images

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] CineScene: Implicit 3D as Effective Scene Representation for Cinematic Video Generation

[Paper] DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

[Paper] Reliable Mislabel Detection for Video Capsule Endoscopy Data

[Paper] Seeing Beyond Redundancy: Task Complexity's Role in Vision Token Specialization in VLLMs