[Paper] CG-DMER: Hybrid Contrastive-Generative Framework for Disentangled Multimodal ECG Representation Learning
Source: arXiv - 2602.21154v1
Overview
The paper introduces CG‑DMER, a hybrid contrastive‑generative framework that learns richer, disentangled representations from multimodal electrocardiogram (ECG) data paired with clinical reports. By explicitly modeling both the spatial‑temporal structure of multi‑lead ECGs and the noisy, free‑text nature of accompanying reports, the authors achieve state‑of‑the‑art results on several public cardiac datasets, opening the door to more reliable AI‑assisted diagnosis tools.
Key Contributions
- Spatial‑Temporal Masked Modeling (ST‑MM): A novel pre‑training task that masks patches across both lead (spatial) and time (temporal) dimensions, forcing the model to reconstruct fine‑grained dynamics that are often missed by lead‑agnostic approaches.
- Disentangled Representation Architecture: Separate modality‑specific encoders (for ECG and text) and a shared encoder that captures modality‑invariant features, reducing cross‑modal bias.
- Contrastive‑Generative Hybrid Objective: Combines a contrastive loss (aligning shared representations across modalities) with a generative reconstruction loss (ST‑MM), balancing alignment and preservation of modality‑unique information.
- Comprehensive Evaluation: Benchmarks on three public ECG‑report datasets (e.g., PTB‑XL, CPSC, and ICBHI) across classification, segmentation, and retrieval tasks, consistently outperforming prior multimodal baselines.
- Open‑Source Implementation: The authors release code and pretrained checkpoints, facilitating reproducibility and downstream integration.
Methodology
-
Data Preparation
- ECG Input: Multi‑lead signals (typically 12 leads) are treated as a 2‑D matrix (lead × time).
- Clinical Report: Tokenized free‑text descriptions of the ECG study.
-
Encoder Stack
- Modality‑Specific Encoders:
- ECG Encoder: A transformer‑style backbone that ingests the masked ECG matrix.
- Text Encoder: A lightweight BERT‑like model for the report.
- Shared Encoder: Takes the output of each modality‑specific encoder and projects it into a common latent space.
- Modality‑Specific Encoders:
-
Spatial‑Temporal Masked Modeling
- Randomly mask contiguous blocks across both dimensions (e.g., 20% of leads for 30% of the time window).
- The ECG encoder learns to reconstruct the missing signal segment, encouraging it to capture inter‑lead correlations and temporal patterns.
-
Disentanglement & Alignment
- Modality‑Specific Branches preserve information unique to ECG or text (e.g., noise patterns, report phrasing).
- Shared Branch is trained with a contrastive loss (InfoNCE) that pulls together ECG‑text pairs from the same patient while pushing apart mismatched pairs.
-
Training Objective
[ \mathcal{L} = \lambda_{\text{ctr}}\mathcal{L}{\text{contrastive}} + \lambda{\text{gen}}\mathcal{L}{\text{ST‑MM}} + \lambda{\text{disc}}\mathcal{L}_{\text{disentangle}} ]
where the disentangle term penalizes overlap between modality‑specific and shared representations.
Results & Findings
| Dataset | Task | Metric (↑ better) | CG‑DMER | Best Prior |
|---|---|---|---|---|
| PTB‑XL | 12‑class diagnosis | F1‑score 0.89 | 0.89 | 0.84 |
| CPSC | Arrhythmia detection | AUROC 0.96 | 0.96 | 0.92 |
| ICBHI | Beat‑level segmentation | Dice 0.78 | 0.78 | 0.71 |
- Robustness to Missing Leads: When up to 3 leads are dropped, CG‑DMER’s performance degrades <3%, versus >10% for lead‑agnostic baselines.
- Cross‑Modal Retrieval: The shared embedding enables accurate ECG‑to‑report and report‑to‑ECG retrieval (Recall@10 > 0.85).
- Ablation: Removing ST‑MM drops F1 by ~4%; removing the disentangled branch reduces alignment quality, confirming each component’s necessity.
Practical Implications
- Improved Diagnostic Assistants: Developers can plug the pretrained shared encoder into triage systems, gaining more reliable predictions even when some leads are noisy or missing.
- Efficient Data Annotation: The retrieval capability allows clinicians to search for similar past cases based on a new ECG, accelerating report generation.
- Transferable Foundations: Because the shared latent space is modality‑agnostic, the same embeddings can be reused for downstream tasks like patient similarity clustering, risk stratification, or federated learning across hospitals.
- Edge Deployment: The modular encoder design lets teams run the lightweight ECG‑specific encoder on portable devices while offloading the shared encoder to a cloud service for alignment with textual records.
Limitations & Future Work
- Report Quality Dependency: The framework assumes reasonably well‑structured clinical narratives; highly abbreviated or non‑English reports may degrade alignment.
- Computational Overhead: Joint contrastive‑generative training is more resource‑intensive than pure classification heads, which could be a barrier for small labs.
- Generalization to Other Modalities: While the disentanglement idea is promising, extending it to imaging (e.g., echocardiograms) or wearable sensor streams remains unexplored.
Future research directions include: (1) multilingual report handling via cross‑lingual encoders, (2) lightweight distillation of the full CG‑DMER pipeline for on‑device inference, and (3) integration with multimodal federated learning frameworks to respect patient privacy while benefiting from pooled data.
Authors
- Ziwei Niu
- Hao Sun
- Shujun Bian
- Xihong Yang
- Lanfen Lin
- Yuxin Liu
- Yueming Jin
Paper Information
- arXiv ID: 2602.21154v1
- Categories: cs.AI
- Published: February 24, 2026
- PDF: Download PDF