[Paper] SignalMC-MED: A Multimodal Benchmark for Evaluating Biosignal Foundation Models on Single-Lead ECG and PPG
Source: arXiv - 2603.09940v1
Overview
The paper introduces SignalMC‑MED, a new benchmark that lets researchers and engineers rigorously compare “foundation models” (large pretrained networks) on synchronized single‑lead ECG and PPG recordings. By packaging over 22 k ten‑minute visits and 20 clinically relevant prediction tasks, the authors provide a realistic, multimodal playground for evaluating how well these models can turn raw biosignals into actionable health insights.
Key Contributions
- SignalMC‑MED benchmark: 22 256 ten‑minute ECG + PPG pairs with 20 downstream tasks (demographics, ED disposition, lab value regression, ICD‑10 diagnosis detection).
- Systematic evaluation of a spectrum of models: generic time‑series transformers, biosignal‑specific foundation models, and hand‑crafted feature baselines.
- Multimodal fusion analysis: Demonstrates consistent gains when combining ECG and PPG versus using either modality alone.
- Signal length study: Shows that the full 10‑minute window outperforms shorter snippets, highlighting the value of long‑duration recordings.
- Model scaling insight: Larger model variants do not guarantee better performance on these tasks.
- Feature‑model hybrid: Hand‑engineered ECG features remain competitive and complement learned representations when fused.
Methodology
- Data preparation – The authors start from the publicly available MC‑MED dataset, extract the overlapping 10‑minute segments where a single‑lead ECG and a fingertip PPG were recorded simultaneously, and align them at the sample level.
- Task definition – Twenty downstream tasks are defined, ranging from binary classification (e.g., “will the patient be admitted?”) to regression (e.g., predicting serum creatinine). Labels are derived from electronic health records linked to each visit.
- Model families
- General time‑series models: vanilla Transformers, InceptionTime, and a recent time‑series FM (e.g., TS‑Transformer).
- Biosignal‑specific FMs: models pretrained on large ECG/PPG corpora (e.g., ECG‑BERT, PPG‑ResNet).
- Hand‑crafted baseline: a set of domain‑knowledge features (RR intervals, QRS width, PPG amplitude, etc.) fed to a gradient‑boosted tree.
- Training regimes – Each model is fine‑tuned on the training split of SignalMC‑MED for each task, using the same hyper‑parameter budget to ensure a fair comparison.
- Fusion strategies – For multimodal experiments, the authors explore early concatenation of raw waveforms, late concatenation of learned embeddings, and attention‑based cross‑modal fusion.
- Evaluation – Standard metrics (AUROC for classification, RMSE for regression) are reported on a held‑out test set, with statistical significance testing across runs.
Results & Findings
| Setting | Best AUROC (avg.) | Observations |
|---|---|---|
| ECG‑only (biosignal FM) | 0.84 | Outperforms generic time‑series FM (≈0.78). |
| PPG‑only (biosignal FM) | 0.81 | Slightly lower than ECG but still strong. |
| ECG + PPG (early fusion) | 0.88 | Consistent boost over unimodal inputs. |
| Hand‑crafted features + FM | 0.90 | Hybrid model yields the highest scores. |
| Full 10‑min vs. 30‑sec windows | +5‑7 % AUROC gain | Longer context matters. |
| Small vs. large model variants | No clear advantage for larger models | Suggests diminishing returns on parameter count for these tasks. |
In plain terms, domain‑specific pretrained models beat generic ones, and combining ECG and PPG gives a noticeable lift. Moreover, the classic approach of extracting physiologic features still holds value, especially when merged with learned embeddings.
Practical Implications
- Model selection: For developers building triage or remote monitoring tools, start with a biosignal‑specific FM (e.g., ECG‑BERT) rather than a generic time‑series transformer.
- Multimodal design: If a device can capture both ECG and PPG (many wearables already do), design pipelines that fuse the two streams early or via cross‑attention to squeeze out extra performance.
- Data collection strategy: Investing in longer recordings (≈10 min) is worthwhile; short bursts may miss subtle temporal patterns crucial for tasks like lab value prediction.
- Hybrid pipelines: Adding a lightweight feature extractor (RR‑interval, heart‑rate variability) on top of a deep FM can boost accuracy without heavy compute overhead—useful for edge deployments.
- Model sizing: Bigger isn’t always better; a modest‑size FM can meet or exceed the performance of a heavyweight counterpart, reducing inference latency and memory footprint on embedded devices.
Limitations & Future Work
- Population bias: The benchmark derives from a single hospital system; external validation on other demographics (e.g., pediatric, non‑Western cohorts) is needed.
- Single‑lead focus: Multi‑lead ECG, which carries richer spatial information, is not covered. Extending the benchmark to 12‑lead data could reveal different scaling behaviours.
- Label noise: Some downstream labels (e.g., ICD‑10 codes) may be imperfect proxies for the underlying physiology, potentially capping achievable performance.
- Fusion exploration: The study evaluates a few fusion strategies; more sophisticated approaches (e.g., graph‑based multimodal reasoning) remain open.
- Real‑time constraints: Benchmarks are offline; future work should assess latency and power consumption for on‑device inference.
By addressing these gaps, the community can turn SignalMC‑MED from a solid evaluation suite into a launchpad for next‑generation, clinically‑ready biosignal AI.
Authors
- Fredrik K. Gustafsson
- Xiao Gu
- Mattia Carletti
- Patitapaban Palo
- David W. Eyre
- David A. Clifton
Paper Information
- arXiv ID: 2603.09940v1
- Categories: cs.LG
- Published: March 10, 2026
- PDF: Download PDF