[Paper] CG-DMER: Hybrid Contrastive-Generative Framework for Disentangled Multimodal ECG Representation Learning

Published: 3 days ago (February 24, 2026 at 12:59 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.21154v1

Overview

The paper introduces CG‑DMER, a hybrid contrastive‑generative framework that learns richer, disentangled representations from multimodal electrocardiogram (ECG) data paired with clinical reports. By explicitly modeling both the spatial‑temporal structure of multi‑lead ECGs and the noisy, free‑text nature of accompanying reports, the authors achieve state‑of‑the‑art results on several public cardiac datasets, opening the door to more reliable AI‑assisted diagnosis tools.

Key Contributions

Spatial‑Temporal Masked Modeling (ST‑MM): A novel pre‑training task that masks patches across both lead (spatial) and time (temporal) dimensions, forcing the model to reconstruct fine‑grained dynamics that are often missed by lead‑agnostic approaches.
Disentangled Representation Architecture: Separate modality‑specific encoders (for ECG and text) and a shared encoder that captures modality‑invariant features, reducing cross‑modal bias.
Contrastive‑Generative Hybrid Objective: Combines a contrastive loss (aligning shared representations across modalities) with a generative reconstruction loss (ST‑MM), balancing alignment and preservation of modality‑unique information.
Comprehensive Evaluation: Benchmarks on three public ECG‑report datasets (e.g., PTB‑XL, CPSC, and ICBHI) across classification, segmentation, and retrieval tasks, consistently outperforming prior multimodal baselines.
Open‑Source Implementation: The authors release code and pretrained checkpoints, facilitating reproducibility and downstream integration.

Methodology

Data Preparation
- ECG Input: Multi‑lead signals (typically 12 leads) are treated as a 2‑D matrix (lead × time).
- Clinical Report: Tokenized free‑text descriptions of the ECG study.
Encoder Stack
- Modality‑Specific Encoders:
  - ECG Encoder: A transformer‑style backbone that ingests the masked ECG matrix.
  - Text Encoder: A lightweight BERT‑like model for the report.
- Shared Encoder: Takes the output of each modality‑specific encoder and projects it into a common latent space.
Spatial‑Temporal Masked Modeling
- Randomly mask contiguous blocks across both dimensions (e.g., 20% of leads for 30% of the time window).
- The ECG encoder learns to reconstruct the missing signal segment, encouraging it to capture inter‑lead correlations and temporal patterns.
Disentanglement & Alignment
- Modality‑Specific Branches preserve information unique to ECG or text (e.g., noise patterns, report phrasing).
- Shared Branch is trained with a contrastive loss (InfoNCE) that pulls together ECG‑text pairs from the same patient while pushing apart mismatched pairs.
Training Objective
[ \mathcal{L} = \lambda_{\text{ctr}}\mathcal{L}{\text{contrastive}} + \lambda{\text{gen}}\mathcal{L}{\text{ST‑MM}} + \lambda{\text{disc}}\mathcal{L}_{\text{disentangle}} ]
where the disentangle term penalizes overlap between modality‑specific and shared representations.

Results & Findings

Dataset	Task	Metric (↑ better)	CG‑DMER	Best Prior
PTB‑XL	12‑class diagnosis	F1‑score 0.89	0.89	0.84
CPSC	Arrhythmia detection	AUROC 0.96	0.96	0.92
ICBHI	Beat‑level segmentation	Dice 0.78	0.78	0.71

Robustness to Missing Leads: When up to 3 leads are dropped, CG‑DMER’s performance degrades <3%, versus >10% for lead‑agnostic baselines.
Cross‑Modal Retrieval: The shared embedding enables accurate ECG‑to‑report and report‑to‑ECG retrieval (Recall@10 > 0.85).
Ablation: Removing ST‑MM drops F1 by ~4%; removing the disentangled branch reduces alignment quality, confirming each component’s necessity.

Practical Implications

Improved Diagnostic Assistants: Developers can plug the pretrained shared encoder into triage systems, gaining more reliable predictions even when some leads are noisy or missing.
Efficient Data Annotation: The retrieval capability allows clinicians to search for similar past cases based on a new ECG, accelerating report generation.
Transferable Foundations: Because the shared latent space is modality‑agnostic, the same embeddings can be reused for downstream tasks like patient similarity clustering, risk stratification, or federated learning across hospitals.
Edge Deployment: The modular encoder design lets teams run the lightweight ECG‑specific encoder on portable devices while offloading the shared encoder to a cloud service for alignment with textual records.

Limitations & Future Work

Report Quality Dependency: The framework assumes reasonably well‑structured clinical narratives; highly abbreviated or non‑English reports may degrade alignment.
Computational Overhead: Joint contrastive‑generative training is more resource‑intensive than pure classification heads, which could be a barrier for small labs.
Generalization to Other Modalities: While the disentanglement idea is promising, extending it to imaging (e.g., echocardiograms) or wearable sensor streams remains unexplored.

Future research directions include: (1) multilingual report handling via cross‑lingual encoders, (2) lightweight distillation of the full CG‑DMER pipeline for on‑device inference, and (3) integration with multimodal federated learning frameworks to respect patient privacy while benefiting from pooled data.

Authors

Ziwei Niu
Hao Sun
Shujun Bian
Xihong Yang
Lanfen Lin
Yuxin Liu
Yueming Jin

Paper Information

arXiv ID: 2602.21154v1
Categories: cs.AI
Published: February 24, 2026
PDF: Download PDF

[Paper] CG-DMER: Hybrid Contrastive-Generative Framework for Disentangled Multimodal ECG Representation Learning

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Model Agreement via Anchoring

[Paper] SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation

[Paper] A Dataset is Worth 1 MB

[Paper] SOTAlign: Semi-Supervised Alignment of Unimodal Vision and Language Models via Optimal Transport