[Paper] Explainable Multimodal Regression via Information Decomposition

Published: (December 26, 2025 at 01:07 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.22102v1

Overview

This paper tackles a core challenge in multimodal regression: understanding how each data source (modality) contributes to a continuous prediction. By grounding the fusion process in Partial Information Decomposition (PID), the authors provide a mathematically principled way to separate unique, redundant, and synergistic information across modalities—making multimodal models far more interpretable for developers and data scientists.

Key Contributions

  • PID‑based regression framework that decomposes latent representations into unique, redundant, and synergistic information components.
  • Gaussianity assumption on the joint distribution of latent codes and the transformed target, which resolves the under‑determined nature of PID and yields closed‑form expressions for all PID terms.
  • Conditional independence regularizer derived analytically to encourage each modality to retain only its unique information, simplifying interpretation and downstream modality selection.
  • Extensive empirical validation on six heterogeneous datasets (including a large‑scale brain‑age prediction task) showing superior predictive performance and clearer attribution of modality contributions compared with state‑of‑the‑art fusion baselines.
  • Open‑source implementation (Python) released under an MIT license, enabling immediate experimentation and integration into existing pipelines.

Methodology

  1. Latent Encoding – Each modality (M_i) is passed through a modality‑specific encoder (e.g., a shallow MLP or CNN) producing a latent vector (Z_i).
  2. Inverse Normal Transformation – The continuous target (Y) is transformed to a Gaussian‑like variable (\tilde{Y}) using an inverse‑normal (quantile) mapping, ensuring the joint distribution ((Z_1,\dots,Z_K,\tilde{Y})) can be modeled as multivariate Gaussian.
  3. Partial Information Decomposition – Under the Gaussian assumption, the mutual information between any subset of latents and (\tilde{Y}) can be expressed analytically. PID then splits this information into:
    • Unique (U_i): information only modality (i) provides,
    • Redundant (R): information shared across modalities,
    • Synergistic (S): information that only emerges when modalities are combined.
  4. Conditional Independence Regularizer – A closed‑form penalty term pushes the covariance matrix toward a block‑diagonal structure, encouraging each (Z_i) to capture only its unique component.
  5. Training Objective – The final loss combines the standard regression loss (e.g., MSE on the original target) with the PID‑derived regularizer, balancing accuracy and interpretability.

All steps are differentiable, so the whole system can be trained end‑to‑end with standard optimizers (Adam, SGD).

Results & Findings

DatasetMetric (lower is better)Baseline (late fusion)PIDReg (proposed)
UCI HousingRMSE 2.312.582.12
Multimodal Sensor (activity)MAE 0.840.970.78
Brain‑Age (MRI + fMRI + DTI)MAE 3.4 years4.1 years3.0 years
  • Predictive gain: Across all six datasets, PIDReg improves accuracy by 5‑15 % relative to the strongest baselines.
  • Interpretability: The PID decomposition reveals, for example, that in the brain‑age task the DTI modality contributes ~45 % unique information, while MRI and fMRI share ~30 % redundant information and together provide ~25 % synergistic gain.
  • Modality selection: By inspecting the unique‑information scores, the authors demonstrate that dropping low‑unique modalities (e.g., fMRI in the brain‑age case) reduces inference cost by ~30 % with < 0.2 year increase in MAE.

Practical Implications

  • Model debugging & feature engineering – Developers can pinpoint which sensor or data stream is actually driving predictions, helping to prioritize data collection or sensor maintenance.
  • Resource‑aware deployment – The unique‑information scores act as a principled “importance” metric, enabling dynamic modality gating (e.g., only request high‑cost modalities when the expected gain exceeds a threshold).
  • Regulatory compliance – In domains like healthcare, the ability to explain how each imaging modality contributes satisfies emerging transparency requirements.
  • Transferable toolkit – Because the method only needs a Gaussian assumption on latent space, it can be dropped into existing multimodal pipelines (vision+text, audio+sensor, etc.) with minimal architectural changes.

Limitations & Future Work

  • Gaussianity assumption – While analytically convenient, it may not hold for highly non‑linear latent spaces; the authors note performance drops on extremely skewed data.
  • Scalability of PID terms – The current closed‑form solution scales quadratically with the number of modalities; extending to dozens of streams will require approximations.
  • Extension to classification – The paper focuses on regression; adapting the PID decomposition to categorical targets is left as future work.
  • Robustness to noisy modalities – Preliminary experiments suggest the regularizer can over‑penalize noisy inputs, so more adaptive weighting schemes are being explored.

Overall, the work provides a solid, mathematically grounded bridge between multimodal fusion performance and interpretability—an advance that developers can start leveraging today while the community pushes the method toward broader, more complex settings.

Authors

  • Zhaozhao Ma
  • Shujian Yu

Paper Information

  • arXiv ID: 2512.22102v1
  • Categories: cs.LG
  • Published: December 26, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »