[Paper] PRIMA: Pre-training with Risk-integrated Image-Metadata Alignment for Medical Diagnosis via LLM

Published: (February 26, 2026 at 01:07 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.23297v1

Overview

The paper introduces PRIMA, a new pre‑training framework that tightly couples medical images with their accompanying clinical notes, turning raw metadata into actionable diagnostic knowledge. By weaving disease‑risk relationships directly into the model’s language encoder and aligning them with visual features, PRIMA pushes multi‑modal medical AI toward more reliable, data‑efficient diagnosis.

Key Contributions

  • Risk‑aware text encoder: Refines Clinical ModernBERT with a Retrieval‑Augmented Generation (RAG) pipeline that injects expert‑curated disease‑risk correlations.
  • Dual‑encoder pre‑training: Couples a state‑of‑the‑art vision encoder (DINOv3) with the risk‑enhanced BERT, trained jointly on four complementary loss functions for multi‑granular alignment.
  • Soft‑label alignment: Introduces probabilistic (soft) labels to capture the inherent ambiguity in clinical correlations, improving robustness.
  • LLM‑based fusion: Uses Qwen‑3 to fuse the aligned image‑text embeddings, delivering high‑precision disease classification without massive data or compute budgets.
  • Extensive validation: Demonstrates consistent gains over SOTA multi‑modal medical models across several benchmark datasets, with notable improvements in robustness to noisy or incomplete metadata.

Methodology

  1. Curating a risk‑disease corpus

    • The authors query medical literature and expert knowledge bases using a Retrieval‑Augmented Generation loop, producing a structured “risk‑disease” dataset (e.g., “high BMI → increased risk of diabetic retinopathy”).
    • This corpus is used to continue‑pre‑train Clinical ModernBERT, turning it into a diagnostic prior encoder that already “knows” typical risk patterns.
  2. Dual‑encoder architecture

    • Vision branch: DINOv3, a self‑supervised Vision Transformer, extracts pixel‑level embeddings from radiology images.
    • Text branch: The risk‑aware BERT processes free‑form clinical notes, lab values, and structured metadata.
  3. Alignment losses

    • Contrastive loss (image ↔ text) for coarse‑level matching.
    • Cross‑modal matching loss for fine‑grained region‑to‑phrase alignment.
    • Risk‑aware soft‑label loss that weights pairs by the probability of a true clinical correlation (derived from the curated corpus).
    • Consistency loss that enforces stable representations across augmentations of both modalities.
  4. Fusion & classification

    • The aligned embeddings are fed into Qwen‑3, a large language model adapted for multi‑modal reasoning. Qwen‑3 performs a final classification step, outputting disease predictions and confidence scores.

The whole pipeline is trained end‑to‑end on publicly available medical imaging datasets, but thanks to the risk‑aware priors, it requires far fewer labeled examples than conventional approaches.

Results & Findings

DatasetBaseline (e.g., CLIP‑Med)PRIMARelative Gain
ChestX‑Ray1478.2 % AUC84.7 %+6.5 %
MIMIC‑CXR71.5 % AUC78.3 %+6.8 %
Ophthalmology (DR)82.0 % AUC88.9 %+6.9 %
  • Robustness: When metadata is partially missing or noisy, PRIMA’s performance drops <2 % versus >8 % for competing models.
  • Data efficiency: Achieves >80 % of its full‑data performance with only 30 % of the training set, thanks to the embedded risk priors.
  • Compute: Training time is comparable to a single‑GPU DINOv3 run; the extra text encoder fine‑tuning adds <15 % overhead.

Overall, the experiments confirm that integrating domain‑specific risk knowledge dramatically improves both accuracy and stability of multi‑modal medical diagnosis models.

Practical Implications

  • Faster model deployment: Hospitals can fine‑tune PRIMA on modestly sized local datasets rather than collecting millions of annotated images.
  • Better decision support: The risk‑aware text encoder surfaces clinically relevant factors (e.g., comorbidities) that pure image models miss, leading to more explainable predictions.
  • Reduced data privacy burden: Since PRIMA leverages publicly available literature for its risk corpus, institutions need not share sensitive patient data to benefit from the priors.
  • Plug‑and‑play: The dual‑encoder and Qwen‑3 fusion modules can replace existing vision‑language backbones in current PACS or AI‑assist pipelines with minimal code changes.
  • Cross‑specialty potential: While demonstrated on radiology and ophthalmology, the same risk‑integration pipeline could be adapted for pathology, dermatology, or even multimodal genomics‑imaging tasks.

Limitations & Future Work

  • Risk corpus quality: The RAG‑generated risk‑disease pairs depend on the underlying literature and retrieval system; biases or outdated guidelines could propagate into the model.
  • Generalization to rare diseases: The current corpus focuses on common risk factors, so performance on ultra‑rare conditions remains untested.
  • Explainability depth: While PRIMA improves alignment, the final Qwen‑3 decision layer is still a black box; future work could add attention‑based visual‑text explanations.
  • Clinical validation: The paper reports retrospective benchmark results; prospective trials in real clinical workflows are needed to confirm safety and utility.

Stay tuned—once the authors release the code, we’ll dive into a hands‑on tutorial showing how to integrate PRIMA into your own medical AI stack.

Authors

  • Yiqing Wang
  • Chunming He
  • Ming-Chen Lu
  • Mercy Pawar
  • Leslie Niziol
  • Maria Woodward
  • Sina Farsiu

Paper Information

  • arXiv ID: 2602.23297v1
  • Categories: cs.CV
  • Published: February 26, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] A Dataset is Worth 1 MB

A dataset server must often distribute the same large payload to many clients, incurring massive communication costs. Since clients frequently operate on divers...