[Paper] ECG-Lens: Benchmarking ML & DL Models on PTB-XL Dataset

Published: 2 days ago (April 17, 2026 at 04:20 AM EDT)

5 min read

Source: arXiv

Source: arXiv - 2604.15822v1

Overview

The paper ECG‑Lens presents a systematic benchmark of six machine‑learning pipelines—three classic classifiers and three deep‑learning architectures—on the large‑scale PTB‑XL 12‑lead electrocardiogram (ECG) dataset. By training the deep models directly on raw waveforms and augmenting the data with a Stationary Wavelet Transform, the authors show that a purpose‑built complex CNN can push classification accuracy to 80 % and ROC‑AUC to 90 %, setting a new practical reference point for automated ECG analysis.

Key Contributions

Comprehensive benchmark of traditional ML (Decision Tree, Random Forest, Logistic Regression) vs. deep learning (simple CNN, LSTM, and the proposed “ECG‑Lens” complex CNN) on the same raw 12‑lead PTB‑XL data.
End‑to‑end raw‑signal training: No hand‑crafted feature extraction; the networks learn discriminative patterns directly from the waveform.
Wavelet‑based data augmentation using Stationary Wavelet Transform (SWT) to increase sample diversity while preserving clinically relevant morphology.
Multi‑metric evaluation (accuracy, precision, recall, F1, ROC‑AUC) providing a holistic view of model performance across imbalanced cardiac classes.
Open‑source benchmark code & trained weights (released with the paper) to accelerate reproducibility and downstream research.

Methodology

Dataset – PTB‑XL, a publicly available collection of 21,837 12‑lead ECG recordings (10‑second duration) labeled with 5 diagnostic super‑classes (e.g., Normal, Myocardial Infarction, etc.).
Pre‑processing – Signals are resampled to a common frequency, normalized, and split into training/validation/test sets (70/15/15).
Data Augmentation – For each training record, an SWT decomposition is performed; selected sub‑bands are recombined with random scaling to generate synthetic variants that retain QRS complexes, P‑waves, and ST‑segments.
Model families
- Traditional ML: Feature vectors are built from time‑domain statistics (mean, variance, skewness) and frequency‑domain descriptors (power spectral density). These feed into Decision Tree, Random Forest, and Logistic Regression classifiers.
- Deep Learning:
  - Simple CNN – 3 convolutional layers + global max‑pooling, trained on raw 12‑lead tensors.
  - LSTM – Two stacked LSTM layers capture temporal dependencies across the 10‑second trace.
  - ECG‑Lens (Complex CNN) – 7 convolutional blocks with residual connections, multi‑scale kernels (3, 5, 7), and squeeze‑and‑excitation modules to adaptively weight lead‑specific information. Ends with a fully‑connected head for multi‑class output.
Training – Adam optimizer, cosine‑annealing learning‑rate schedule, early stopping on validation loss. Class imbalance is mitigated with weighted cross‑entropy.
Evaluation – Confusion matrices, per‑class precision/recall, macro‑averaged F1, and ROC‑AUC (one‑vs‑rest) are reported.

Results & Findings

Model	Accuracy	Macro F1	ROC‑AUC
Decision Tree	58 %	0.52	0.71
Random Forest	63 %	0.58	0.77
Logistic Regression	61 %	0.55	0.74
Simple CNN	71 %	0.66	0.84
LSTM	73 %	0.68	0.86
ECG‑Lens (Complex CNN)	80 %	0.75	0.90

Deep models consistently outperformed classic ML, confirming that raw waveform learning captures richer morphology than handcrafted statistics.
ECG‑Lens achieved the best trade‑off across all metrics, especially ROC‑AUC, indicating strong discriminative power even for minority classes.
Wavelet augmentation contributed ~3‑4 % absolute gain in accuracy for the deep models, demonstrating its utility for limited‑size medical time‑series.

Practical Implications

Rapid prototyping for health‑tech startups – The benchmark shows that a well‑designed CNN can be trained on publicly available data and reach clinically relevant performance without costly feature engineering.
Edge deployment – ECG‑Lens’s architecture, while deeper than a simple CNN, remains lightweight enough (≈1.2 M parameters) for inference on modern micro‑controllers or mobile devices, enabling point‑of‑care arrhythmia screening.
Model selection guidance – Teams can use the provided performance table to justify a shift from traditional ML pipelines (easier to interpret but less accurate) to deep CNNs when higher diagnostic sensitivity is required.
Data‑augmentation recipe – The SWT‑based augmentation can be plugged into existing pipelines to mitigate class imbalance, a common pain point in medical datasets.
Regulatory pathways – By benchmarking against a recognized standard (PTB‑XL) and reporting a full suite of metrics, developers gain a baseline that can be referenced in FDA/EMA submissions for AI‑based ECG analysis tools.

Limitations & Future Work

Dataset scope – PTB‑XL, while large, contains only 10‑second recordings from a single acquisition protocol; performance on longer Holter or wearable ECG streams remains untested.
Interpretability – The paper focuses on accuracy metrics; explainability methods (e.g., saliency maps, attention) are not explored, which are crucial for clinical acceptance.
Generalization to rare pathologies – Some diagnostic subclasses have very few examples; even the best model shows reduced recall on these, suggesting a need for targeted data collection or few‑shot learning techniques.
Real‑world validation – No external validation on a separate hospital dataset is presented; future work should assess domain shift and robustness to noise/artifact variations.

Bottom line: ECG‑Lens demonstrates that a thoughtfully engineered convolutional network, trained end‑to‑end on raw 12‑lead ECGs and bolstered by wavelet‑based augmentation, can set a new performance bar for automated cardiac diagnosis. For developers building AI‑driven health products, the paper offers a ready‑to‑use architecture, a reproducible benchmark, and practical insights on scaling from research to production.

Authors

Saloni Garg
Ukant Jadia
Amit Sagtani
Kamal Kant Hiran

Paper Information

arXiv ID: 2604.15822v1
Categories: cs.LG, cs.AI, cs.CE, cs.NE, eess.SP
Published: April 17, 2026
PDF: Download PDF

[Paper] ECG-Lens: Benchmarking ML & DL Models on PTB-XL Dataset

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] ASMR-Bench: Auditing for Sabotage in ML Research

[Paper] Geometric regularization of autoencoders via observed stochastic dynamics

[Paper] Using Large Language Models and Knowledge Graphs to Improve the Interpretability of Machine Learning Models in Manufacturing

[Paper] Evaluating the Progression of Large Language Model Capabilities for Small-Molecule Drug Design