[Paper] Adaptive Clinical-Aware Latent Diffusion for Multimodal Brain Image Generation and Missing Modality Imputation

Published: 14 hours ago (March 10, 2026 at 01:26 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2603.09931v1

Overview

The paper introduces ACADiff, a novel diffusion‑based framework that can generate missing brain‑imaging modalities (e.g., structural MRI, FDG‑PET, AV45‑PET) while simultaneously leveraging patients’ clinical metadata. By treating the synthesis problem as a latent‑space denoising task that is “clinical‑aware,” the authors demonstrate that even when 80 % of the imaging data are absent, the generated scans remain diagnostically useful for Alzheimer’s disease (AD) research.

Key Contributions

Adaptive Clinical‑Aware Diffusion: A latent diffusion model that conditions on any subset of available modalities and on structured clinical information (age, MMSE, APOE status, etc.).
Dynamic Fusion Mechanism: An attention‑based module that re‑weights inputs on‑the‑fly, allowing the same network to handle arbitrary missing‑modality patterns.
GPT‑4o Prompt Integration: Clinical notes are encoded via GPT‑4o prompts, providing semantic guidance that improves realism and disease‑relevant features.
Bidirectional Modality Generators: Three specialized generators enable synthesis in both directions (e.g., sMRI → FDG‑PET, PET → sMRI), facilitating flexible data augmentation pipelines.
State‑of‑the‑Art Performance: On the ADNI cohort, ACADiff outperforms prior imputation methods (e.g., VAE‑based, GAN‑based, and vanilla diffusion) in image quality metrics (PSNR, SSIM) and downstream AD classification accuracy.
Open‑Source Release: Full code, pretrained checkpoints, and a reproducibility guide are publicly available on GitHub.

Methodology

Latent Diffusion Backbone – The authors adopt a pre‑trained autoencoder to map high‑resolution brain scans into a compact latent space. A diffusion process then iteratively adds Gaussian noise to these latents and learns to reverse it.
Clinical Conditioning – Structured clinical variables are embedded and concatenated with the latent representation at each diffusion step. In parallel, free‑form clinical notes are transformed into dense vectors using GPT‑4o, providing richer semantic context.
Adaptive Fusion Layer – An attention module receives the set of available imaging latents (any subset of the three modalities) and the clinical embeddings, producing a fused context vector that guides denoising. The attention weights are learned to automatically prioritize the most informative inputs.
Bidirectional Generators – Three separate diffusion models are trained for each source‑target pair (sMRI↔FDG‑PET, sMRI↔AV45‑PET, FDG‑PET↔AV45‑PET). During inference, the appropriate generator is selected based on which modality is missing.
Training Objective – The standard diffusion loss (mean‑squared error between predicted and true noise) is augmented with a clinical consistency loss that penalizes deviations from known disease biomarkers (e.g., hippocampal volume, SUVr values).

Results & Findings

Scenario	PSNR ↑	SSIM ↑	AD Classification AUC ↑
0 % missing (full data)	31.2	0.94	0.92
50 % random missing	28.7	0.90	0.88
80 % missing (extreme)	26.4	0.86	0.84

ACADiff consistently beats the strongest baseline (a conditional GAN) by +2.3 dB PSNR and +0.05 AUC even when most modalities are absent.
Qualitative inspection shows that disease‑specific patterns (e.g., cortical thinning, hypometabolism) are faithfully reproduced in the synthesized scans.
Ablation studies confirm that both the adaptive fusion layer and the GPT‑4o clinical prompts contribute significantly; removing either drops AUC by ~0.03.

Practical Implications

Data Augmentation for Rare Cohorts: Researchers can fill in missing PET scans for legacy sMRI‑only studies, dramatically expanding usable sample sizes without costly new acquisitions.
Robust Clinical Decision Support: Diagnostic pipelines that rely on multimodal inputs become tolerant to incomplete scans, reducing the need for repeat imaging appointments.
Edge‑Device Deployment: Because synthesis occurs in latent space, the computational footprint is modest; a modern GPU can generate a missing modality in under 2 seconds, making it feasible for on‑site preprocessing in hospitals.
Cross‑Modality Transfer Learning: The bidirectional generators enable pre‑training of downstream models (e.g., segmentation, disease progression) on synthetic PET data when only MRI is available.

Limitations & Future Work

Generalization Beyond ADNI: The model is trained on a single, well‑curated dataset; performance on heterogeneous clinical sites with different scanner protocols remains untested.
Reliance on GPT‑4o Prompts: The clinical‑note encoder depends on a proprietary LLM, which may limit reproducibility for groups without access to the same API.
Interpretability of Fusion Weights: While the adaptive fusion improves flexibility, the authors note that visualizing why certain modalities dominate in specific cases is still an open challenge.
Future Directions: Extending ACADiff to incorporate longitudinal time‑points, exploring lightweight transformer alternatives for the clinical encoder, and validating the approach on other neurodegenerative diseases (e.g., Parkinson’s) are suggested next steps.

Authors

Rong Zhou
Houliang Zhou
Yao Su
Brian Y. Chen
Yu Zhang
Lifang He
Alzheimer’s Disease Neuroimaging Initiative

Paper Information

arXiv ID: 2603.09931v1
Categories: cs.CV, cs.AI
Published: March 10, 2026
PDF: Download PDF

[Paper] Adaptive Clinical-Aware Latent Diffusion for Multimodal Brain Image Generation and Missing Modality Imputation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] From Data Statistics to Feature Geometry: How Correlations Shape Superposition

[Paper] BEACON: Language-Conditioned Navigation Affordance Prediction under Occlusion

[Paper] From Semantics to Pixels: Coarse-to-Fine Masked Autoencoders for Hierarchical Visual Understanding

[Paper] No Image, No Problem: End-to-End Multi-Task Cardiac Analysis from Undersampled k-Space