[Paper] Bridging the gap between Performance and Interpretability: An Explainable Disentangled Multimodal Framework for Cancer Survival Prediction

Published: 1 day ago (March 2, 2026 at 01:26 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2603.02162v1

Overview

The paper presents DIMAFx, a new multimodal deep‑learning framework that predicts cancer patient survival by jointly analyzing whole‑slide histopathology images and bulk transcriptomics data. Unlike many high‑performing models that act as black boxes, DIMAFx is built to be explainable: it learns separate (modality‑specific) and shared representations, and couples them with SHAP‑based attribution to reveal which visual and molecular cues drive each prediction. The authors demonstrate that this design not only matches or exceeds state‑of‑the‑art accuracy on several cancer cohorts but also provides biologically meaningful insights that could be directly useful for clinicians and data scientists working on precision oncology.

Key Contributions

Disentangled multimodal architecture that explicitly separates modality‑specific and modality‑shared latent spaces, improving interpretability without sacrificing predictive power.
DIMAFx achieves state‑of‑the‑art survival prediction across multiple cancer types (e.g., breast, lung, colorectal) using only two data modalities: whole‑slide images (WSI) and bulk RNA‑seq.
Integrated SHAP analysis on the disentangled representations, enabling systematic identification of the most influential multimodal interactions.
Biological validation: the most predictive shared features correspond to known cancer pathways (e.g., estrogen‑response signaling in breast cancer) and to histopathological patterns (e.g., high‑grade tumor morphology).
Open‑source implementation (released with the paper) that can be readily adapted to other multimodal biomedical tasks.

Methodology

Data preprocessing
- WSIs are tiled into 256 × 256 px patches, filtered for tissue content, and encoded with a pretrained ResNet‑50 backbone.
- Transcriptomics profiles are log‑normalized, filtered for the most variable genes, and projected into a lower‑dimensional space using a simple fully‑connected encoder.
Disentangled representation learning
- The model contains three parallel encoders:
  - Image‑specific encoder → latent vector z_img
  - RNA‑specific encoder → latent vector z_rna
  - Shared encoder that receives concatenated image and RNA embeddings and outputs z_shared.
- A disentanglement loss (based on orthogonality constraints and mutual information minimization) forces each latent subspace to capture information unique to its modality or truly shared across modalities.
Survival prediction head
- The three latent vectors are concatenated and fed into a Cox proportional‑hazards layer, producing a risk score for each patient.
Explainability pipeline
- After training, SHAP values are computed for each latent dimension, then back‑propagated to the original inputs (image patches and gene expression).
- This yields modality‑specific heatmaps on the slides and gene‑importance rankings, which can be inspected jointly to understand cross‑modal interactions.
Evaluation
- Concordance index (C‑index) is used as the primary metric.
- Experiments compare DIMAFx against unimodal baselines, early‑fusion deep nets, and recent multimodal survival models.

Results & Findings

Cohort	Baseline (unimodal) C‑index	Early‑fusion multimodal C‑index	DIMAFx C‑index
Breast (TCGA‑BRCA)	0.68 (image) / 0.66 (RNA)	0.71	0.75
Lung (TCGA‑LUAD)	0.63 / 0.61	0.66	0.70
Colorectal (TCGA‑COAD)	0.66 / 0.64	0.68	0.72

Disentanglement quality: Mutual information between modality‑specific and shared latents drops by ~30 % compared with a naïve joint encoder, confirming that the model learns truly separate representations.
Biological relevance:
- The top shared feature in breast cancer correlates with a solid tumor morphology pattern that is strongly weighted toward the late estrogen response gene set. Higher SHAP scores for this feature align with higher tumor grade and poorer survival, matching clinical knowledge.
- Modality‑specific image features highlight adipose and stromal regions, suggesting the model captures micro‑environmental cues that are invisible to transcriptomics alone.
Explainability case study: For a high‑risk patient, SHAP heatmaps pinpoint a region of necrotic tissue, while the corresponding RNA SHAP values flag upregulation of hypoxia‑related pathways—demonstrating a coherent multimodal story.

Practical Implications

Clinical decision support: By surfacing interpretable image‑gene interactions, oncologists can validate model suggestions against pathology reports and molecular diagnostics, increasing trust in AI‑driven risk scores.
Feature engineering for downstream tasks: The disentangled latent vectors can be reused as compact, biologically meaningful embeddings for other predictive tasks (e.g., treatment response, drug sensitivity).
Data‑efficient model building: Because the shared encoder learns from both modalities, DIMAFx can maintain performance even when one data source is partially missing—a common scenario in real‑world registries.
Regulatory friendliness: Explainable AI is a growing requirement for medical device approval; DIMAFx’s SHAP‑based audit trail aligns with emerging FDA guidance on model transparency.
Open‑source toolkit: The released code includes utilities for WSI tiling, gene‑set enrichment of SHAP scores, and visualization dashboards, lowering the barrier for hospitals or biotech firms to prototype their own multimodal pipelines.

Limitations & Future Work

Scalability to larger modality sets: The current design focuses on two modalities; extending the disentanglement framework to three or more (e.g., radiology, proteomics) may require more sophisticated orthogonality constraints.
Computational cost: Training on whole‑slide images still demands high‑memory GPUs and long runtimes; the authors note that patch selection heuristics could be optimized.
Cohort diversity: Experiments are limited to TCGA datasets, which have relatively homogeneous patient demographics; validation on multi‑institutional, real‑world cohorts is needed.
Causal interpretation: While SHAP highlights associations, it does not prove causality; future work could integrate causal inference methods to distinguish predictive biomarkers from confounders.

Overall, DIMAFx demonstrates that it is possible to bridge the classic trade‑off between predictive performance and interpretability in multimodal cancer survival models, opening the door for more transparent AI tools in precision medicine.

Authors

Aniek Eijpe
Soufyan Lakbir
Melis Erdal Cesur
Sara P. Oliveira
Angelos Chatzimparmpas
Sanne Abeln
Wilson Silva

Paper Information

arXiv ID: 2603.02162v1
Categories: cs.CV
Published: March 2, 2026
PDF: Download PDF

[Paper] Bridging the gap between Performance and Interpretability: An Explainable Disentangled Multimodal Framework for Cancer Survival Prediction

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] HiFi-Inpaint: Towards High-Fidelity Reference-Based Inpainting for Generating Detail-Preserving Human-Product Images

[Paper] Adaptive Confidence Regularization for Multimodal Failure Detection

[Paper] From Leaderboard to Deployment: Code Quality Challenges in AV Perception Repositories

[Paper] Sketch2Colab: Sketch-Conditioned Multi-Human Animation via Controllable Flow Distillation