[Paper] Zero-Shot Imagined Speech Decoding via Imagined-to-Listened MEG Mapping
Source: arXiv - 2605.08075v1
Overview
This paper tackles the notoriously hard problem of decoding imagined speech—the internal “voice” we hear when we think of words—using non‑invasive magnetoencephalography (MEG). By cleverly borrowing information from listening‑to‑speech recordings, the authors demonstrate a zero‑shot pipeline that can predict what a person is silently saying, even for subjects the model has never seen before.
Key Contributions
- Cross‑modal mapping: Trains models that translate imagined‑MEG signals into their “listened” counterparts, preserving stimulus‑specific information.
- Two‑stage decoder reuse: Leverages a word decoder trained only on listened data (no imagined labels needed) and applies it to the mapped imagined signals.
- Zero‑shot evaluation: Demonstrates successful decoding on completely held‑out subjects, confirming subject‑independent generalization.
- Scalability insight: Shows decoding accuracy improves with more paired listened/imagined data, suggesting the approach can scale to larger datasets.
- Proof‑of‑concept for BCI: Provides a concrete pipeline that could be integrated into brain‑computer interfaces for silent communication.
Methodology
-
Data collection – 12 trained musicians performed two tasks while MEG was recorded:
- Listening: hearing rhythmic melodic and spoken stimuli.
- Imagining: silently rehearsing the same stimuli.
Using musicians helped keep the timing of imagined speech aligned with the actual audio.
-
Imagined‑to‑Listened mapping – Six models (linear regressors and shallow neural nets) were trained on paired imagined‑MEG ↔ listened‑MEG data to predict what the brain activity would look like if the subject were actually listening.
-
Word decoding – A contrastive decoder (trained only on listened‑MEG) learns to map brain activity to word embeddings. Four embedding spaces were tested:
- Semantic (e.g., GloVe)
- Acoustic (spectrogram‑derived)
- Phonetic (phone‑level vectors)
- Hybrid combinations
-
Zero‑shot pipeline – For a new subject:
- Feed imagined MEG through the best mapping model → synthetic listened MEG.
- Run the synthetic signal through the pre‑trained word decoder → ranked list of candidate words.
-
Evaluation – Rank‑based metrics (e.g., top‑k accuracy, mean reciprocal rank) compare the decoded word list against the true imagined word, using only unseen subjects for testing.
Results & Findings
- Mapping success: All six mapping models outperformed a null baseline (random mapping) on held‑out subjects, confirming that stimulus‑specific structure survives the transformation.
- Decoding performance: The best configuration (neural mapping + semantic embeddings) achieved ~30 % top‑1 accuracy and >70 % top‑5 accuracy on a 10‑word vocabulary—well above the chance level of 10 %.
- Data scaling: Doubling the number of paired sessions raised top‑1 accuracy by ~5 %, indicating a roughly linear benefit from more training data.
- Embedding impact: Semantic embeddings yielded the highest ranks, while purely acoustic embeddings performed worse, suggesting imagined speech aligns more with meaning than with exact acoustic patterns.
Practical Implications
- Silent communication interfaces: Developers of brain‑computer interfaces (BCIs) could embed this pipeline to let users issue commands or type by merely thinking words, without needing invasive electrodes.
- Assistive technology: For patients with speech motor impairments (e.g., ALS), a zero‑shot decoder reduces the calibration burden—only a short listening session is required to bootstrap imagined‑speech decoding.
- Neuro‑feedback tools: Real‑time mapping could provide musicians or language learners with feedback on internal rehearsal quality, opening new training paradigms.
- Scalable data collection: Because the decoder relies on abundant listened data, existing speech‑MEG corpora can be repurposed, accelerating development cycles for commercial BCI products.
Limitations & Future Work
- Small participant pool: The study involved only a dozen musicians; broader demographic testing is needed to confirm generalizability.
- Vocabulary size: Experiments were limited to a modest set of words; scaling to open‑vocabulary speech will require richer embedding and language models.
- Temporal resolution: MEG provides high temporal fidelity, but the pipeline’s reliance on precise alignment may struggle with less disciplined subjects.
- Model complexity: Only shallow linear and neural models were explored; deeper architectures (e.g., transformers) could capture subtler imagined‑listening relationships.
- Real‑time feasibility: The current pipeline processes whole trials offline; future work should optimize for low‑latency, online decoding suitable for interactive applications.
Authors
- Maryam Maghsoudi
- Shihab Shamma
Paper Information
- arXiv ID: 2605.08075v1
- Categories: cs.LG, eess.AS
- Published: May 8, 2026
- PDF: Download PDF