[Paper] Adaptive Clinical-Aware Latent Diffusion for Multimodal Brain Image Generation and Missing Modality Imputation
Source: arXiv - 2603.09931v1
Overview
The paper introduces ACADiff, a novel diffusion‑based framework that can generate missing brain‑imaging modalities (e.g., structural MRI, FDG‑PET, AV45‑PET) while simultaneously leveraging patients’ clinical metadata. By treating the synthesis problem as a latent‑space denoising task that is “clinical‑aware,” the authors demonstrate that even when 80 % of the imaging data are absent, the generated scans remain diagnostically useful for Alzheimer’s disease (AD) research.
Key Contributions
- Adaptive Clinical‑Aware Diffusion: A latent diffusion model that conditions on any subset of available modalities and on structured clinical information (age, MMSE, APOE status, etc.).
- Dynamic Fusion Mechanism: An attention‑based module that re‑weights inputs on‑the‑fly, allowing the same network to handle arbitrary missing‑modality patterns.
- GPT‑4o Prompt Integration: Clinical notes are encoded via GPT‑4o prompts, providing semantic guidance that improves realism and disease‑relevant features.
- Bidirectional Modality Generators: Three specialized generators enable synthesis in both directions (e.g., sMRI → FDG‑PET, PET → sMRI), facilitating flexible data augmentation pipelines.
- State‑of‑the‑Art Performance: On the ADNI cohort, ACADiff outperforms prior imputation methods (e.g., VAE‑based, GAN‑based, and vanilla diffusion) in image quality metrics (PSNR, SSIM) and downstream AD classification accuracy.
- Open‑Source Release: Full code, pretrained checkpoints, and a reproducibility guide are publicly available on GitHub.
Methodology
- Latent Diffusion Backbone – The authors adopt a pre‑trained autoencoder to map high‑resolution brain scans into a compact latent space. A diffusion process then iteratively adds Gaussian noise to these latents and learns to reverse it.
- Clinical Conditioning – Structured clinical variables are embedded and concatenated with the latent representation at each diffusion step. In parallel, free‑form clinical notes are transformed into dense vectors using GPT‑4o, providing richer semantic context.
- Adaptive Fusion Layer – An attention module receives the set of available imaging latents (any subset of the three modalities) and the clinical embeddings, producing a fused context vector that guides denoising. The attention weights are learned to automatically prioritize the most informative inputs.
- Bidirectional Generators – Three separate diffusion models are trained for each source‑target pair (sMRI↔FDG‑PET, sMRI↔AV45‑PET, FDG‑PET↔AV45‑PET). During inference, the appropriate generator is selected based on which modality is missing.
- Training Objective – The standard diffusion loss (mean‑squared error between predicted and true noise) is augmented with a clinical consistency loss that penalizes deviations from known disease biomarkers (e.g., hippocampal volume, SUVr values).
Results & Findings
| Scenario | PSNR ↑ | SSIM ↑ | AD Classification AUC ↑ |
|---|---|---|---|
| 0 % missing (full data) | 31.2 | 0.94 | 0.92 |
| 50 % random missing | 28.7 | 0.90 | 0.88 |
| 80 % missing (extreme) | 26.4 | 0.86 | 0.84 |
- ACADiff consistently beats the strongest baseline (a conditional GAN) by +2.3 dB PSNR and +0.05 AUC even when most modalities are absent.
- Qualitative inspection shows that disease‑specific patterns (e.g., cortical thinning, hypometabolism) are faithfully reproduced in the synthesized scans.
- Ablation studies confirm that both the adaptive fusion layer and the GPT‑4o clinical prompts contribute significantly; removing either drops AUC by ~0.03.
Practical Implications
- Data Augmentation for Rare Cohorts: Researchers can fill in missing PET scans for legacy sMRI‑only studies, dramatically expanding usable sample sizes without costly new acquisitions.
- Robust Clinical Decision Support: Diagnostic pipelines that rely on multimodal inputs become tolerant to incomplete scans, reducing the need for repeat imaging appointments.
- Edge‑Device Deployment: Because synthesis occurs in latent space, the computational footprint is modest; a modern GPU can generate a missing modality in under 2 seconds, making it feasible for on‑site preprocessing in hospitals.
- Cross‑Modality Transfer Learning: The bidirectional generators enable pre‑training of downstream models (e.g., segmentation, disease progression) on synthetic PET data when only MRI is available.
Limitations & Future Work
- Generalization Beyond ADNI: The model is trained on a single, well‑curated dataset; performance on heterogeneous clinical sites with different scanner protocols remains untested.
- Reliance on GPT‑4o Prompts: The clinical‑note encoder depends on a proprietary LLM, which may limit reproducibility for groups without access to the same API.
- Interpretability of Fusion Weights: While the adaptive fusion improves flexibility, the authors note that visualizing why certain modalities dominate in specific cases is still an open challenge.
- Future Directions: Extending ACADiff to incorporate longitudinal time‑points, exploring lightweight transformer alternatives for the clinical encoder, and validating the approach on other neurodegenerative diseases (e.g., Parkinson’s) are suggested next steps.
Authors
- Rong Zhou
- Houliang Zhou
- Yao Su
- Brian Y. Chen
- Yu Zhang
- Lifang He
- Alzheimer’s Disease Neuroimaging Initiative
Paper Information
- arXiv ID: 2603.09931v1
- Categories: cs.CV, cs.AI
- Published: March 10, 2026
- PDF: Download PDF