[Paper] Benchmarking ERP Analysis: Manual Features, Deep Learning, and Foundation Models
Source: arXiv - 2601.00573v1
Overview
This paper presents the first large‑scale benchmark that pits classic hand‑crafted EEG features against modern deep‑learning and foundation‑model approaches for event‑related potential (ERP) analysis. By evaluating 12 public ERP datasets on two core tasks—stimulus classification and disease detection—the authors give developers a clear picture of which methods actually work in real‑world ERP scenarios.
Key Contributions
- Unified benchmarking pipeline: Standardized preprocessing, training, and evaluation across 12 ERP datasets, eliminating “apples‑to‑oranges” comparisons.
- Comprehensive method comparison: Includes (1) traditional manual features + linear classifier, (2) state‑of‑the‑art deep‑learning models (CNNs, RNNs, Transformers), and (3) pre‑trained EEG foundation models (e.g., EEG‑BERT, SEED‑Transformer).
- Patch‑embedding study for Transformers: Systematic exploration of how to split ERP time‑series into patches (temporal, spatial, or spatio‑temporal) and how these choices affect performance.
- Open‑source codebase: All scripts, model configs, and evaluation metrics are released at https://github.com/DL4mHealth/ERP‑Benchmark, enabling reproducibility and easy extension.
- Practical guidelines: The authors synthesize their findings into actionable recommendations for selecting or designing ERP models in applied settings.
Methodology
-
Data collection & preprocessing
- 12 publicly available ERP datasets covering visual, auditory, and oddball paradigms.
- Uniform pipeline: band‑pass filtering (0.5–40 Hz), epoch extraction (−200 ms to 800 ms relative to stimulus), baseline correction, and optional artifact rejection.
-
Feature & model families
- Manual features: Power spectral density, peak latency/amplitude, Hjorth parameters, etc., fed into a linear SVM.
- Deep learning: CNNs (e.g., EEGNet), RNNs (GRU/LSTM), and vanilla Transformers that ingest raw epochs.
- Foundation models: Pre‑trained on massive EEG corpora (≥1 M recordings) and fine‑tuned on each ERP task.
-
Patch‑embedding strategies
- Temporal patches: Split each channel’s time series into fixed‑length windows.
- Spatial patches: Group electrodes (e.g., frontal vs. occipital) and treat each group as a token.
- Spatio‑temporal patches: Combine both dimensions, akin to image patches in Vision Transformers.
-
Evaluation
- Two downstream tasks: (a) Stimulus classification (e.g., target vs. non‑target) and (b) Disease detection (e.g., Alzheimer’s vs. healthy).
- Metrics: accuracy, F1‑score, and area‑under‑ROC.
- Repeated 5‑fold cross‑validation to ensure robustness.
Results & Findings
| Approach | Stimulus Classification (Avg. Acc.) | Disease Detection (Avg. Acc.) |
|---|---|---|
| Manual features + Linear SVM | 71.2 % | 68.5 % |
| CNN (EEGNet) | 78.9 % | 74.3 % |
| RNN (GRU) | 80.1 % | 75.6 % |
| Vanilla Transformer (temporal patches) | 81.4 % | 77.2 % |
| Foundation model (EEG‑BERT, spatio‑temporal patches) | 84.7 % | 81.9 % |
- Foundation models consistently outperformed both manual and vanilla deep‑learning pipelines, especially on disease‑detection tasks where subtle patterns matter.
- Spatio‑temporal patch embeddings gave the best Transformer performance, confirming that preserving electrode topology while capturing temporal dynamics is crucial for ERP data.
- Training from scratch with the same architecture lagged behind pre‑trained models by ~3–5 % absolute accuracy, highlighting the value of transfer learning.
Practical Implications
- Faster prototyping: Developers can start with the released fine‑tuned foundation model and achieve state‑of‑the‑art performance without collecting massive ERP datasets.
- Reduced reliance on domain expertise: Manual feature engineering, which often requires neurophysiology knowledge, can be largely replaced by pre‑trained models.
- Edge deployment: The benchmark shows that a compact CNN (e.g., EEGNet) still delivers respectable results (~78 % accuracy) with a fraction of the compute, making it suitable for wearable or bedside devices.
- Design guidance for new ERP products: When building BCI‑enabled applications (e.g., attention monitoring, neuro‑feedback), the study suggests using spatio‑temporal patch embeddings in Transformers or leveraging existing EEG foundation models for rapid iteration.
Limitations & Future Work
- Dataset diversity: Although 12 datasets were used, they are all laboratory‑controlled ERP paradigms; real‑world noisy recordings (e.g., mobile EEG) remain untested.
- Model size vs. latency: The best‑performing foundation models are large; the paper does not explore quantization or pruning for low‑latency inference.
- Interpretability: While accuracy improves, the authors note a lack of insight into which ERP components drive decisions—a gap for clinical adoption.
- Future directions suggested include (1) extending the benchmark to on‑device inference, (2) integrating explainable‑AI techniques to map model decisions back to classic ERP components, and (3) evaluating continual‑learning setups where models adapt to new subjects over time.
Authors
- Yihe Wang
- Zhiqiao Kang
- Bohan Chen
- Yu Zhang
- Xiang Zhang
Paper Information
- arXiv ID: 2601.00573v1
- Categories: cs.NE, cs.CE
- Published: January 2, 2026
- PDF: Download PDF