[Paper] MedMO: Grounding and Understanding Multimodal Large Language Model for Medical Images

Published: (February 6, 2026 at 01:59 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.06965v1

Overview

MedMO is a new multimodal large language model (MLLM) that bridges the gap between cutting‑edge vision‑language AI and real‑world medical imaging. By training a unified model on massive, domain‑specific radiology, ophthalmology, and pathology data, the authors demonstrate that a single system can answer visual questions, generate diagnostic reports, retrieve similar cases, and pinpoint disease locations with bounding‑box precision—capabilities that were previously scattered across specialized tools.

Key Contributions

  • Domain‑focused pretraining: Aligns multiple visual encoders (CT, fundus, microscopy) with a medical‑language backbone using only publicly available medical image‑text pairs.
  • Comprehensive instruction tuning: Covers five core tasks—image captioning, visual QA, report generation, image‑text retrieval, and grounded disease localization.
  • Reinforcement learning with verifiable rewards: Introduces a dual‑reward scheme (factuality + box‑level GIoU) that explicitly teaches the model to reason step‑by‑step and produce spatially accurate outputs.
  • Two released model sizes (4B & 8B parameters): Enables developers to pick a lightweight version for edge deployment or a larger one for research‑grade performance.
  • Cross‑modality generalization: Validated on radiology, ophthalmology, and pathology datasets, showing consistent gains over existing open‑source medical MLLMs.

Methodology

  1. Cross‑modal pretraining – Visual encoders (e.g., a ResNet‑based CT encoder, a Swin‑Transformer for fundus images) are frozen initially and jointly trained with a medical language model (based on LLaMA) to learn a shared embedding space. This step ensures that visual features can be “spoken about” in natural language.

  2. Instruction tuning – The model is exposed to a curated set of prompts that mimic real clinical workflows:

    • Captioning: “Describe the findings in this chest X‑ray.”
    • VQA: “Is there evidence of pneumothorax?”
    • Report generation: “Write a radiology report for this image.”
    • Retrieval: “Find similar cases to this slide.”
    • Grounded localization: “Draw a box around the lesion.”
      The supervision comes from expert‑annotated datasets, providing both textual answers and bounding‑box labels.
  3. Reinforcement learning with verifiable rewards – After instruction tuning, the model is fine‑tuned with PPO. Two reward signals guide learning:

    • Factuality reward – A separate verifier checks whether generated text aligns with known medical facts (e.g., using a knowledge base or rule‑based checks).
    • Spatial reward – The Intersection‑over‑Union (GIoU) between predicted and ground‑truth boxes is computed; higher overlap yields higher reward.
      This dual‑reward loop pushes the model toward both accurate reasoning and precise visual grounding.

Results & Findings

TaskMetricMedMO‑4BMedMO‑8BBaseline Open‑Source MLLMFleming‑VL (SOTA)
Visual QA (radiology)Accuracy ↑+13.7 % over baselinewithin 1.9 % of SOTA
Text‑based QAAccuracy ↑+6.9 % over baseline+14.5 % over Fleming‑VL
Report GenerationClinical BLEU / CheXbert F1 ↑Significant gains (≈+12 % BLEU)
Grounded LocalizationIoU ↑+40.4 % over baseline+37.0 % over Fleming‑VL
Cross‑modality (radiology, ophthalmology, pathology)Consistent improvement across all datasets

Takeaway: MedMO not only beats existing open‑source medical MLLMs by a wide margin but also closes the performance gap with the proprietary state‑of‑the‑art Fleming‑VL, especially in spatial reasoning—a critical factor for clinical decision support.

Practical Implications

  • Clinical decision support: Radiologists can query images (“Is there a pleural effusion?”) and receive both a concise answer and a highlighted region, reducing time spent on manual inspection.
  • Automated reporting: Hospitals can generate first‑draft radiology or pathology reports that already meet semantic and clinical accuracy thresholds, freeing clinicians to focus on interpretation rather than dictation.
  • Case‑based learning & education: Medical trainees can retrieve similar historical cases with visual explanations, accelerating learning curves.
  • Edge deployment: The 4B version fits on modern GPUs (e.g., RTX 3080) enabling on‑premise deployment in hospitals with strict data‑privacy policies.
  • Multi‑specialty integration: Because the model handles CT, fundus, and microscopy images, a single AI service can be offered across radiology, ophthalmology, and pathology departments, simplifying infrastructure and maintenance.

Limitations & Future Work

  • Data bias: Training data are sourced from publicly available repositories, which may under‑represent rare diseases or under‑served populations, potentially limiting generalization.
  • Explainability beyond boxes: While bounding‑box grounding is a step forward, clinicians often need richer explanations (e.g., heatmaps, textual rationales) that are not fully addressed.
  • Regulatory readiness: The model has not undergone formal clinical validation or FDA‑style evaluation, so deployment in production settings will require additional safety studies.
  • Future directions: The authors plan to incorporate multimodal self‑supervision from unlabelled hospital PACS archives, expand to 3‑D imaging (MRI/CT volumes), and integrate structured knowledge graphs for deeper reasoning.

Authors

  • Ankan Deria
  • Komal Kumar
  • Adinath Madhavrao Dukre
  • Eran Segal
  • Salman Khan
  • Imran Razzak

Paper Information

  • arXiv ID: 2602.06965v1
  • Categories: cs.CV
  • Published: February 6, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »