[Paper] ASR Error Correction in Low-Resource Burmese with Alignment-Enhanced Transformers using Phonetic Features

Published: (November 26, 2025 at 01:13 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2511.21088v1

Overview

The paper presents the first systematic study of automatic speech‑recognition (ASR) error correction for Burmese, a language with very limited annotated speech data. By augmenting a standard Transformer‑based sequence‑to‑sequence model with phonetic (IPA) cues and alignment information, the authors achieve a sizable drop in word error rate (WER) and boost character‑level quality metrics, even when the underlying ASR system is weak.

Key Contributions

  • First Burmese‑specific AEC research – establishes a benchmark for a truly low‑resource language.
  • Feature‑enhanced Transformer architecture that injects (i) International Phonetic Alphabet (IPA) representations of the input text and (ii) token‑level alignment masks into the encoder‑decoder attention.
  • Comprehensive evaluation across five diverse ASR backbones (CNN‑RNN, CTC, wav2vec‑2.0, etc.), showing consistent improvements regardless of the base model.
  • Robustness analysis with and without data augmentation, demonstrating that the proposed AEC still yields gains when the ASR training data are artificially expanded.
  • Open‑source release of the code, pretrained models, and a small Burmese speech‑text corpus for reproducibility.

Methodology

  1. Baseline ASR pipeline – Train five off‑the‑shelf ASR models on the same low‑resource Burmese corpus (≈ 30 h of transcribed speech).
  2. Error‑correction model (AEC) – A standard Transformer encoder‑decoder is modified in two ways:
    • Phonetic embedding: Each input token is paired with its IPA transcription (generated via a rule‑based grapheme‑to‑phoneme converter). The IPA token is embedded and summed with the original word embedding, giving the model a pronunciation‑aware view of the text.
    • Alignment mask: Using the ASR’s token‑level confidence scores and forced alignment, a binary mask tells the attention layers which positions are likely erroneous, encouraging the decoder to focus on correcting those spots.
  3. Training – The AEC is trained on pairs of raw ASR output → gold transcription using a cross‑entropy loss plus a small auxiliary loss that penalizes changes to high‑confidence tokens (to avoid over‑correction).
  4. Evaluation – Word Error Rate (WER) and chrF++ (character‑level F‑score) are computed on a held‑out test set. Experiments are run both on the raw ASR outputs and on outputs after simple data augmentation (speed‑perturbation, noise injection).

Results & Findings

MetricAvg. ASR (5 models)+ AEC (IPA + Alignment)Δ Improvement
WER (no augmentation)51.56 %39.82 %‑11.74 % absolute
WER (with augmentation)51.56 %43.59 %‑7.97 % absolute
chrF++ (no augmentation)0.58640.627+0.0406
chrF++ (with augmentation)0.58640.618+0.0316
  • All five ASR backbones benefited from the same AEC model, confirming model‑agnostic robustness.
  • Adding only IPA or only alignment gave modest gains; the combined configuration consistently outperformed either alone, highlighting the complementary nature of phonetic and positional cues.
  • The AEC rarely introduced new errors on high‑confidence tokens, thanks to the auxiliary loss, which kept the correction focused on truly problematic regions.

Practical Implications

  • Rapid quality lift for low‑resource speech products – Deploying an AEC layer on top of any existing Burmese ASR (or similar under‑resourced languages) can shave ~10 % off WER without retraining the acoustic model.
  • Cost‑effective pipeline – Since the AEC operates on text, it sidesteps the need for more expensive acoustic data collection; developers can improve user‑facing voice assistants, transcription services, or captioning tools with a lightweight post‑processor.
  • Phonetic‑aware NLP – The IPA embedding technique can be reused for other downstream tasks (e.g., spelling correction, language modeling) where pronunciation information is valuable.
  • Open‑source toolkit – The authors provide a ready‑to‑run Docker image and scripts, making it easy for engineers to plug the correction model into existing speech pipelines (e.g., Kaldi, ESPnet, Hugging Face 🤗 Transformers).
  • Transferability – The alignment‑mask concept works with any confidence‑scoring ASR, so the same approach could be adapted to languages like Khmer, Lao, or even dialectal variants of larger languages.

Limitations & Future Work

  • Data size – The study is constrained to ~30 h of Burmese speech; performance on larger corpora or with more diverse speakers remains untested.
  • Rule‑based IPA conversion – Errors in the grapheme‑to‑phoneme step can propagate to the AEC; a learned G2P model could improve robustness.
  • Real‑time latency – Adding a Transformer‑based post‑processor introduces extra inference time; optimizing for on‑device or streaming scenarios is an open challenge.
  • Cross‑language validation – While the authors hypothesize similar gains for other low‑resource languages, empirical verification is needed.

Bottom line: For developers building speech‑enabled applications in Burmese—or any language where high‑quality ASR data are scarce—stacking a phonetic‑ and alignment‑enhanced Transformer on top of the recognizer offers a pragmatic, plug‑and‑play route to markedly better transcription quality.

Authors

  • Ye Bhone Lin
  • Thura Aung
  • Ye Kyaw Thu
  • Thazin Myint Oo

Paper Information

  • arXiv ID: 2511.21088v1
  • Categories: cs.CL, cs.LG, cs.SD
  • Published: November 26, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »