[Paper] Cross-Modal Masking for Robust Silent Speech Synthesis Using sEMG and Lipreading

Published: 3 days ago (June 8, 2026 at 11:50 AM EDT)

2 min read

Source: arXiv

Source: arXiv - 2606.09667v1

Overview

Speech restoration through silent speech interfaces (SSIs) has emerged as a promising assistive technology for individuals with impaired or absent laryngeal voice production. Among non-invasive SSI modalities, surface electromyography (sEMG) and video-based lipreading provide complementary articulatory information, yet their integration for continuous speech synthesis remains underexplored. Moreover, existing multimodal approaches rarely address robustness to modality degradation or temporary sensor failure, limiting their applicability in realistic scenarios. In this work, we propose a masked multimodal speech synthesis framework that jointly leverages sEMG and lipreading signals through modality masking during training. Under multispeaker settings, the proposed approach reduces word error rate by up to 14 absolute percentage points compared to the strongest unimodal baseline. Experimental results not only show that masking strategies are critical for these performance gains and robustness under low-bitrate conditions, but also that they generalize better than degradation-specific data augmentations in the presence of modality absence conditions. Phone-level analyses further reveal complementary contributions across modalities, with particularly strong benefits for vowels and for specific consonant groups. Overall, these findings demonstrate the effectiveness and robustness of masked multimodal integration for silent speech synthesis, although adaptation to laryngectomized speakers remains an open research challenge.

Key Contributions

This paper presents research in the following areas:

eess.AS
cs.CL
cs.SD

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of eess.AS.

Authors

Eder del Blanco
David Gimeno-Gómez
Eva Navas
Carlos-D. Martínez-Hinarejos
Inma Hernáez

Paper Information

arXiv ID: 2606.09667v1
Categories: eess.AS, cs.CL, cs.SD
Published: June 8, 2026
PDF: Download PDF

[Paper] Cross-Modal Masking for Robust Silent Speech Synthesis Using sEMG and Lipreading

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] Context-Driven Incremental Compression for Multi-Turn Dialogue Generation

[Paper] Doc-to-Atom: Learning to Compile and Compose Memory Atoms

[Paper] Redesign Mixture-of-Experts Routers with Manifold Power Iteration

[Paper] System Report for CCL25-Eval Task 5: New Dataset and LoRA-Fine-Tuned Qwen2.5