[Paper] MEGConformer: Conformer-Based MEG Decoder for Robust Speech and Phoneme Classification

Published: (December 1, 2025 at 04:25 AM EST)
3 min read
Source: arXiv

Source: arXiv - 2512.01443v1

Overview

The paper introduces MEGConformer, a compact Conformer‑based decoder that translates raw magnetoencephalography (MEG) recordings into two fundamental speech‑related outputs:

  1. Detecting when a person is speaking.
  2. Classifying the phoneme being uttered.

By tailoring a state‑of‑the‑art Conformer architecture to the high‑dimensional, 306‑channel MEG data used in the LibriBrain 2025 PNPL competition, the authors achieve performance that outstrips the competition baselines and lands them in the top‑10 for both tasks.

Key Contributions

  • Conformer adaptation for MEG – a lightweight Conformer encoder paired with a simple convolutional projection layer that can ingest raw 306‑channel MEG streams.
  • Task‑specific heads – separate output modules for binary speech detection and 100‑class phoneme classification.
  • MEG‑oriented SpecAugment – a novel augmentation strategy that masks time‑frequency patches directly on MEG spectrograms, improving robustness to sensor noise.
  • Class‑balanced training – inverse‑square‑root weighting and dynamic grouping loaders to handle the heavily imbalanced phoneme distribution across 100 averaged samples.
  • Instance‑level normalization – a cheap yet effective preprocessing step that mitigates distribution shift between training and hold‑out splits.
  • Open‑source release – full code, documentation, and pretrained checkpoints are publicly available on GitHub.

Methodology

  1. Data preprocessing – Raw MEG recordings (306 channels, 1 kHz sampling) are transformed into short‑time Fourier spectra. An instance‑level z‑normalization is applied per recording to align sensor statistics.
  2. Projection layer – A shallow 1‑D convolution reduces the 306‑channel tensor to a lower‑dimensional embedding (e.g., 64 channels) while preserving temporal resolution.
  3. Conformer encoder – The compact Conformer (≈4 M parameters) stacks self‑attention, convolutional, and feed‑forward modules, enabling the model to capture both long‑range temporal dependencies and local sensor patterns.
  4. Task heads
    • Speech Detection: a binary classifier head (sigmoid) trained with binary cross‑entropy.
    • Phoneme Classification: a 100‑way softmax head trained with cross‑entropy, using inverse‑square‑root class weights to counteract the natural phoneme frequency imbalance.
  5. Training tricks
    • MEG‑SpecAugment: random time‑masking and frequency‑masking applied directly on the MEG spectrograms.
    • Dynamic grouping loader: batches are constructed to contain a balanced mix of the 100 averaged phoneme samples, reducing variance during training.
    • Optimization: AdamW optimizer with a cosine learning‑rate schedule; early stopping based on macro‑F1 on the validation split.

Results & Findings

TaskMetric (Macro‑F1)Leaderboard Rank
Speech Detection88.9 %Top‑10
Phoneme Classification65.8 %Top‑10
  • Both scores surpass the official competition baselines by a comfortable margin (≈7 pp for speech detection, ≈12 pp for phoneme classification).
  • Ablation experiments show that removing instance‑level normalization drops phoneme F1 by ~4 pp, while disabling MEG‑SpecAugment reduces speech detection F1 by ~2 pp.
  • The compact Conformer (≈4 M parameters) runs inference at ~30 ms per second of MEG data on a single RTX 3080, making it feasible for near‑real‑time applications.

Practical Implications

  • Brain‑computer interfaces (BCIs) – Reliable detection of speech onset and phoneme decoding from MEG opens the door to silent‑speech BCI systems for users with motor impairments.
  • Neuro‑feedback & language research – Real‑time phoneme classification can be used to study speech production dynamics, providing immediate feedback to clinicians or language‑learning tools.
  • Edge deployment – The model’s modest size and fast inference mean it can be integrated into portable MEG setups or cloud‑based pipelines without prohibitive compute costs.
  • Cross‑modal translation – Coupling MEGConformer with text‑to‑speech or translation models could enable end‑to‑end pipelines that convert neural activity directly into synthesized speech in another language.

Limitations & Future Work

  • Dataset specificity – The model is tuned to the LibriBrain 2025 PNPL data (clean, read speech). Generalization to spontaneous or noisy speech remains untested.
  • Sensor coverage – Performance may degrade on MEG systems with fewer channels or different sensor layouts, as the projection layer assumes 306 channels.
  • Temporal resolution – While the Conformer captures long‑range dependencies, the current pipeline processes 1‑second windows, limiting sub‑phoneme granularity.
  • Future directions proposed by the authors include:
    • Extending the architecture to multimodal inputs (e.g., simultaneous EEG).
    • Exploring self‑supervised pre‑training on large unlabeled MEG corpora.
    • Adapting the model for real‑time closed‑loop BCI control.

Authors

  • Xabier de Zuazo
  • Ibon Saratxaga
  • Eva Navas

Paper Information

  • arXiv ID: 2512.01443v1
  • Categories: cs.CL, cs.LG, cs.NE, cs.SD
  • Published: December 1, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »