[Paper] PRIMA: Pre-training with Risk-integrated Image-Metadata Alignment for Medical Diagnosis via LLM

Published: 3 days ago (February 26, 2026 at 01:07 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.23297v1

Overview

The paper introduces PRIMA, a new pre‑training framework that tightly couples medical images with their accompanying clinical notes, turning raw metadata into actionable diagnostic knowledge. By weaving disease‑risk relationships directly into the model’s language encoder and aligning them with visual features, PRIMA pushes multi‑modal medical AI toward more reliable, data‑efficient diagnosis.

Key Contributions

Risk‑aware text encoder: Refines Clinical ModernBERT with a Retrieval‑Augmented Generation (RAG) pipeline that injects expert‑curated disease‑risk correlations.
Dual‑encoder pre‑training: Couples a state‑of‑the‑art vision encoder (DINOv3) with the risk‑enhanced BERT, trained jointly on four complementary loss functions for multi‑granular alignment.
Soft‑label alignment: Introduces probabilistic (soft) labels to capture the inherent ambiguity in clinical correlations, improving robustness.
LLM‑based fusion: Uses Qwen‑3 to fuse the aligned image‑text embeddings, delivering high‑precision disease classification without massive data or compute budgets.
Extensive validation: Demonstrates consistent gains over SOTA multi‑modal medical models across several benchmark datasets, with notable improvements in robustness to noisy or incomplete metadata.

Methodology

Curating a risk‑disease corpus
- The authors query medical literature and expert knowledge bases using a Retrieval‑Augmented Generation loop, producing a structured “risk‑disease” dataset (e.g., “high BMI → increased risk of diabetic retinopathy”).
- This corpus is used to continue‑pre‑train Clinical ModernBERT, turning it into a diagnostic prior encoder that already “knows” typical risk patterns.
Dual‑encoder architecture
- Vision branch: DINOv3, a self‑supervised Vision Transformer, extracts pixel‑level embeddings from radiology images.
- Text branch: The risk‑aware BERT processes free‑form clinical notes, lab values, and structured metadata.
Alignment losses
- Contrastive loss (image ↔ text) for coarse‑level matching.
- Cross‑modal matching loss for fine‑grained region‑to‑phrase alignment.
- Risk‑aware soft‑label loss that weights pairs by the probability of a true clinical correlation (derived from the curated corpus).
- Consistency loss that enforces stable representations across augmentations of both modalities.
Fusion & classification
- The aligned embeddings are fed into Qwen‑3, a large language model adapted for multi‑modal reasoning. Qwen‑3 performs a final classification step, outputting disease predictions and confidence scores.

The whole pipeline is trained end‑to‑end on publicly available medical imaging datasets, but thanks to the risk‑aware priors, it requires far fewer labeled examples than conventional approaches.

Results & Findings

Dataset	Baseline (e.g., CLIP‑Med)	PRIMA	Relative Gain
ChestX‑Ray14	78.2 % AUC	84.7 %	+6.5 %
MIMIC‑CXR	71.5 % AUC	78.3 %	+6.8 %
Ophthalmology (DR)	82.0 % AUC	88.9 %	+6.9 %

Robustness: When metadata is partially missing or noisy, PRIMA’s performance drops <2 % versus >8 % for competing models.
Data efficiency: Achieves >80 % of its full‑data performance with only 30 % of the training set, thanks to the embedded risk priors.
Compute: Training time is comparable to a single‑GPU DINOv3 run; the extra text encoder fine‑tuning adds <15 % overhead.

Overall, the experiments confirm that integrating domain‑specific risk knowledge dramatically improves both accuracy and stability of multi‑modal medical diagnosis models.

Practical Implications

Faster model deployment: Hospitals can fine‑tune PRIMA on modestly sized local datasets rather than collecting millions of annotated images.
Better decision support: The risk‑aware text encoder surfaces clinically relevant factors (e.g., comorbidities) that pure image models miss, leading to more explainable predictions.
Reduced data privacy burden: Since PRIMA leverages publicly available literature for its risk corpus, institutions need not share sensitive patient data to benefit from the priors.
Plug‑and‑play: The dual‑encoder and Qwen‑3 fusion modules can replace existing vision‑language backbones in current PACS or AI‑assist pipelines with minimal code changes.
Cross‑specialty potential: While demonstrated on radiology and ophthalmology, the same risk‑integration pipeline could be adapted for pathology, dermatology, or even multimodal genomics‑imaging tasks.

Limitations & Future Work

Risk corpus quality: The RAG‑generated risk‑disease pairs depend on the underlying literature and retrieval system; biases or outdated guidelines could propagate into the model.
Generalization to rare diseases: The current corpus focuses on common risk factors, so performance on ultra‑rare conditions remains untested.
Explainability depth: While PRIMA improves alignment, the final Qwen‑3 decision layer is still a black box; future work could add attention‑based visual‑text explanations.
Clinical validation: The paper reports retrospective benchmark results; prospective trials in real clinical workflows are needed to confirm safety and utility.

Stay tuned—once the authors release the code, we’ll dive into a hands‑on tutorial showing how to integrate PRIMA into your own medical AI stack.

Authors

Yiqing Wang
Chunming He
Ming-Chen Lu
Mercy Pawar
Leslie Niziol
Maria Woodward
Sina Farsiu

Paper Information

arXiv ID: 2602.23297v1
Categories: cs.CV
Published: February 26, 2026
PDF: Download PDF

[Paper] PRIMA: Pre-training with Risk-integrated Image-Metadata Alignment for Medical Diagnosis via LLM

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] MediX-R1: Open Ended Medical Reinforcement Learning

[Paper] VGG-T$^3$: Offline Feed-Forward 3D Reconstruction at Scale

[Paper] SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation

[Paper] A Dataset is Worth 1 MB