[Paper] ViSpeechFormer: A Phonemic Approach for Vietnamese Automatic Speech Recognition

Published: (February 10, 2026 at 12:26 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.10003v1

Overview

The paper introduces ViSpeechFormer, a new Vietnamese automatic speech‑recognition (ASR) system that works at the phoneme level instead of the more common character‑ or word‑level modeling. Because Vietnamese orthography is highly phonetic—each written letter maps almost one‑to‑one to a sound—the authors argue that a phoneme‑centric approach can boost accuracy, especially for out‑of‑vocabulary (OOV) words and noisy training data.

Key Contributions

  • First phoneme‑based Vietnamese ASR framework that explicitly learns phonemic representations.
  • Transformer‑style architecture (ViSpeechFormer) that integrates acoustic feature extraction with a phoneme decoder, bridging speech and phonology.
  • Empirical validation on two public Vietnamese corpora, showing superior word error rates (WER) compared with strong baselines.
  • Demonstrated robustness to OOV words and reduced sensitivity to training‑set bias, thanks to the language‑independent phoneme modeling.
  • Generalizable design that can be adapted to other languages with transparent orthographies (e.g., Korean, Finnish).

Methodology

  1. Data preprocessing – Audio recordings are converted to log‑Mel filterbank features. A grapheme‑to‑phoneme (G2P) lexicon for Vietnamese is built using existing pronunciation dictionaries, yielding a phoneme sequence for each transcript.
  2. Model architecture
    • Encoder: A stack of Conformer blocks (convolution‑augmented Transformers) processes the acoustic features, capturing both local and global temporal patterns.
    • Decoder: A standard Transformer decoder attends to the encoder output and predicts phoneme tokens autoregressively.
    • CTC auxiliary loss is applied on the encoder output to stabilize training.
  3. Training objective – A weighted sum of the cross‑entropy loss (decoder) and CTC loss (encoder) is minimized.
  4. Inference – Beam search with a phoneme‑to‑grapheme (P2G) conversion step produces the final Vietnamese text. The P2G step is deterministic because of the near‑one‑to‑one mapping, making post‑processing simple and fast.

The pipeline is deliberately kept modular: you can swap the encoder (e.g., replace Conformer with a CNN) or the decoder (e.g., use a lightweight LSTM) without breaking the phoneme‑centric logic.

Results & Findings

DatasetBaseline (Char‑level Transformer)ViSpeechFormer (Phoneme)Relative WER ↓
VCTK‑VI (≈100 h)12.8 %10.3 %19 %
VLSP‑ASR (≈200 h)9.5 %7.9 %17 %
  • OOV robustness: When the test set contains a high proportion of rare words (e.g., proper nouns), ViSpeechFormer’s error rate drops by ~25 % relative to the character baseline.
  • Training bias: Experiments where the training data is artificially skewed toward a subset of speakers show that the phoneme model degrades far less than the character model, indicating better generalization across speaker variations.
  • Ablation: Removing the CTC auxiliary loss increases WER by ~1.5 %, confirming its regularizing effect.

Overall, the phoneme‑first paradigm yields a cleaner alignment between acoustic signals and linguistic units, which translates into measurable accuracy gains.

Practical Implications

  • Faster deployment: The deterministic P2G conversion eliminates the need for large language models at inference time, reducing latency for real‑time applications (e.g., voice assistants, transcription services).
  • Better handling of new vocabulary: Companies can roll out updates (new product names, slang) without retraining the entire acoustic model—just extend the phoneme lexicon.
  • Cross‑language portability: The same architecture can be re‑trained on any language with a transparent orthography, offering a reusable ASR stack for multilingual products.
  • Lower data requirements: Because phonemes abstract away from spelling idiosyncrasies, the model learns more efficiently from limited labeled audio, which is valuable for low‑resource Vietnamese domains (e.g., regional dialects).

Developers building Vietnamese speech interfaces can thus expect higher accuracy, lower latency, and easier maintenance compared with traditional character‑based ASR pipelines.

Limitations & Future Work

  • Dialectal variation: The current G2P lexicon assumes Standard Vietnamese pronunciation; regional accents may still cause mismatches.
  • Lexicon dependence: Errors in the phoneme dictionary propagate directly to the final transcript; building a high‑quality, exhaustive lexicon remains a bottleneck.
  • Scalability to truly low‑resource settings: While phoneme modeling reduces data needs, the experiments still rely on several hundred hours of labeled speech.
  • Future directions suggested by the authors include:
    1. Integrating a learnable G2P module to handle out‑of‑lexicon phonemes.
    2. Extending the framework to code‑switching scenarios (Vietnamese–English).
    3. Exploring self‑supervised pre‑training on massive unlabeled Vietnamese audio to further close the gap for under‑represented dialects.

Authors

  • Khoa Anh Nguyen
  • Long Minh Hoang
  • Nghia Hieu Nguyen
  • Luan Thanh Nguyen
  • Ngan Luu-Thuy Nguyen

Paper Information

  • arXiv ID: 2602.10003v1
  • Categories: cs.CL
  • Published: February 10, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »