[Paper] SignalMC-MED: A Multimodal Benchmark for Evaluating Biosignal Foundation Models on Single-Lead ECG and PPG

Published: 13 hours ago (March 10, 2026 at 01:32 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2603.09940v1

Overview

The paper introduces SignalMC‑MED, a new benchmark that lets researchers and engineers rigorously compare “foundation models” (large pretrained networks) on synchronized single‑lead ECG and PPG recordings. By packaging over 22 k ten‑minute visits and 20 clinically relevant prediction tasks, the authors provide a realistic, multimodal playground for evaluating how well these models can turn raw biosignals into actionable health insights.

Key Contributions

SignalMC‑MED benchmark: 22 256 ten‑minute ECG + PPG pairs with 20 downstream tasks (demographics, ED disposition, lab value regression, ICD‑10 diagnosis detection).
Systematic evaluation of a spectrum of models: generic time‑series transformers, biosignal‑specific foundation models, and hand‑crafted feature baselines.
Multimodal fusion analysis: Demonstrates consistent gains when combining ECG and PPG versus using either modality alone.
Signal length study: Shows that the full 10‑minute window outperforms shorter snippets, highlighting the value of long‑duration recordings.
Model scaling insight: Larger model variants do not guarantee better performance on these tasks.
Feature‑model hybrid: Hand‑engineered ECG features remain competitive and complement learned representations when fused.

Methodology

Data preparation – The authors start from the publicly available MC‑MED dataset, extract the overlapping 10‑minute segments where a single‑lead ECG and a fingertip PPG were recorded simultaneously, and align them at the sample level.
Task definition – Twenty downstream tasks are defined, ranging from binary classification (e.g., “will the patient be admitted?”) to regression (e.g., predicting serum creatinine). Labels are derived from electronic health records linked to each visit.
Model families
- General time‑series models: vanilla Transformers, InceptionTime, and a recent time‑series FM (e.g., TS‑Transformer).
- Biosignal‑specific FMs: models pretrained on large ECG/PPG corpora (e.g., ECG‑BERT, PPG‑ResNet).
- Hand‑crafted baseline: a set of domain‑knowledge features (RR intervals, QRS width, PPG amplitude, etc.) fed to a gradient‑boosted tree.
Training regimes – Each model is fine‑tuned on the training split of SignalMC‑MED for each task, using the same hyper‑parameter budget to ensure a fair comparison.
Fusion strategies – For multimodal experiments, the authors explore early concatenation of raw waveforms, late concatenation of learned embeddings, and attention‑based cross‑modal fusion.
Evaluation – Standard metrics (AUROC for classification, RMSE for regression) are reported on a held‑out test set, with statistical significance testing across runs.

Results & Findings

Setting	Best AUROC (avg.)	Observations
ECG‑only (biosignal FM)	0.84	Outperforms generic time‑series FM (≈0.78).
PPG‑only (biosignal FM)	0.81	Slightly lower than ECG but still strong.
ECG + PPG (early fusion)	0.88	Consistent boost over unimodal inputs.
Hand‑crafted features + FM	0.90	Hybrid model yields the highest scores.
Full 10‑min vs. 30‑sec windows	+5‑7 % AUROC gain	Longer context matters.
Small vs. large model variants	No clear advantage for larger models	Suggests diminishing returns on parameter count for these tasks.

In plain terms, domain‑specific pretrained models beat generic ones, and combining ECG and PPG gives a noticeable lift. Moreover, the classic approach of extracting physiologic features still holds value, especially when merged with learned embeddings.

Practical Implications

Model selection: For developers building triage or remote monitoring tools, start with a biosignal‑specific FM (e.g., ECG‑BERT) rather than a generic time‑series transformer.
Multimodal design: If a device can capture both ECG and PPG (many wearables already do), design pipelines that fuse the two streams early or via cross‑attention to squeeze out extra performance.
Data collection strategy: Investing in longer recordings (≈10 min) is worthwhile; short bursts may miss subtle temporal patterns crucial for tasks like lab value prediction.
Hybrid pipelines: Adding a lightweight feature extractor (RR‑interval, heart‑rate variability) on top of a deep FM can boost accuracy without heavy compute overhead—useful for edge deployments.
Model sizing: Bigger isn’t always better; a modest‑size FM can meet or exceed the performance of a heavyweight counterpart, reducing inference latency and memory footprint on embedded devices.

Limitations & Future Work

Population bias: The benchmark derives from a single hospital system; external validation on other demographics (e.g., pediatric, non‑Western cohorts) is needed.
Single‑lead focus: Multi‑lead ECG, which carries richer spatial information, is not covered. Extending the benchmark to 12‑lead data could reveal different scaling behaviours.
Label noise: Some downstream labels (e.g., ICD‑10 codes) may be imperfect proxies for the underlying physiology, potentially capping achievable performance.
Fusion exploration: The study evaluates a few fusion strategies; more sophisticated approaches (e.g., graph‑based multimodal reasoning) remain open.
Real‑time constraints: Benchmarks are offline; future work should assess latency and power consumption for on‑device inference.

By addressing these gaps, the community can turn SignalMC‑MED from a solid evaluation suite into a launchpad for next‑generation, clinically‑ready biosignal AI.

Authors

Fredrik K. Gustafsson
Xiao Gu
Mattia Carletti
Patitapaban Palo
David W. Eyre
David A. Clifton

Paper Information

arXiv ID: 2603.09940v1
Categories: cs.LG
Published: March 10, 2026
PDF: Download PDF

[Paper] SignalMC-MED: A Multimodal Benchmark for Evaluating Biosignal Foundation Models on Single-Lead ECG and PPG

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Task Aware Modulation Using Representation Learning for Upsaling of Terrestrial Carbon Fluxes

[Paper] From Data Statistics to Feature Geometry: How Correlations Shape Superposition

[Paper] Understanding the Use of a Large Language Model-Powered Guide to Make Virtual Reality Accessible for Blind and Low Vision People

[Paper] Emotional Modulation in Swarm Decision Dynamics