[Paper] From Muscle to Text with MyoText: sEMG to Text via Finger Classification and Transformer-Based Decoding

Published: (January 6, 2026 at 10:30 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.03098v1

Overview

The paper introduces MyoText, a new pipeline that turns surface electromyography (sEMG) recordings of hand muscles into typed text. By first recognizing which finger is being activated, then mapping those activations to letters using ergonomic typing rules, and finally polishing the output with a language‑model transformer, the authors achieve a markedly more accurate and scalable sEMG‑to‑text system—paving the way for truly keyboard‑free interaction in wearables and mixed‑reality (XR) environments.

Key Contributions

  • Hierarchical decoding architecture – separates the problem into (1) finger‑activation classification, (2) ergonomically‑guided letter inference, and (3) transformer‑based sentence reconstruction.
  • CNN‑BiLSTM‑Attention model for robust multi‑channel sEMG finger classification, reaching 85.4 % accuracy across 30 participants.
  • Ergonomic typing priors that constrain the letter‑selection space based on realistic finger‑to‑key mappings, dramatically reducing decoding ambiguity.
  • Fine‑tuned T5 transformer that corrects residual errors and produces fluent sentences, delivering 5.4 % character error rate (CER) and 6.5 % word error rate (WER)—substantially better than prior end‑to‑end baselines.
  • Comprehensive evaluation on the public emg2qwerty dataset, demonstrating reproducibility and user‑independent performance.

Methodology

1. Signal Acquisition & Pre‑processing

Multichannel sEMG is recorded from forearm muscles while users type on a virtual QWERTY layout. Standard band‑pass filtering and windowing (≈200 ms frames) prepare the data for neural processing.

2. Finger Classification (CNN‑BiLSTM‑Attention)

  • CNN: A shallow 1‑D CNN extracts spatial patterns across the electrode array.
  • BiLSTM: A bidirectional LSTM captures temporal dynamics of muscle activation within each window.
  • Attention: Highlights the most informative time steps, improving robustness to noise and inter‑user variability.

3. Ergonomic Letter Inference

The predicted finger (e.g., index, middle) is combined with a typing prior that encodes which keys each finger normally reaches on a QWERTY keyboard. A simple probabilistic mapping (softmax over candidate letters) yields a shortlist of likely characters for each frame.

4. Transformer‑Based Decoding (T5)

The sequence of candidate letters (including blanks for “no‑key” frames) is fed into a pre‑trained T5 model fine‑tuned on the same sEMG‑text pairs. The transformer leverages language context to resolve ambiguities, insert missing spaces, and correct spelling, outputting the final sentence.

The modular design mirrors how a human typist thinks: first decide which finger to move, then which key that finger should hit, and finally what the sentence means.

Results & Findings

MetricMyoTextBest Prior Baseline
Finger‑classification accuracy85.4 %~78 %
Character Error Rate (CER)5.4 %9.8 %
Word Error Rate (WER)6.5 %12.3 %
  • Error reduction: The hierarchical approach cuts CER by roughly 45 % compared with end‑to‑end CNN‑only models.
  • User generalization: Performance remains stable across participants, indicating the model learns physiologically relevant features rather than over‑fitting to a single user’s muscle patterns.
  • Ablation studies show that removing the ergonomic prior or the transformer stage degrades CER/WER by >2 %, confirming each component’s contribution.

Practical Implications

  • Keyboard‑free XR input: Developers can embed MyoText into AR glasses or VR headsets, letting users “type” by subtle finger muscle activations without any physical hardware.
  • Assistive technology: For users with limited hand mobility, the system offers a low‑fatigue, high‑accuracy alternative to eye‑tracking or switch‑based text entry.
  • Wearable integration: The modular pipeline can run on edge devices (e.g., a microcontroller + on‑device inference accelerator) because the heavy language model can be off‑loaded or quantized, while the CNN‑BiLSTM runs in real time.
  • Extensibility: The ergonomic prior can be swapped for other layouts (Dvorak, custom virtual keyboards) or even for non‑typing gestures, making the framework a general “muscle‑to‑command” engine.

Limitations & Future Work

  • Dataset scope: Experiments are limited to the emg2qwerty dataset (30 participants, controlled typing tasks). Real‑world conditions—varying arm positions, motion artifacts, or outdoor environments—remain untested.
  • Latency: The current window‑based processing introduces a modest delay (~200 ms). Optimizing the pipeline for sub‑100 ms latency will be crucial for fluid conversational typing.
  • Generalization to other languages: The ergonomic prior and T5 fine‑tuning are English‑centric; extending to multilingual keyboards will require new priors and language models.
  • Hardware constraints: High‑density sEMG arrays improve accuracy but increase power and form‑factor demands; future work should explore sparse electrode layouts and on‑sensor preprocessing.

Overall, MyoText demonstrates that a physiologically grounded, hierarchical decoding strategy can bridge the gap between raw muscle signals and natural language, offering a compelling blueprint for the next generation of neural‑driven user interfaces.

Authors

  • Meghna Roy Chowdhury
  • Shreyas Sen
  • Yi Ding

Paper Information

  • arXiv ID: 2601.03098v1
  • Categories: cs.LG, cs.NE
  • Published: January 6, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »