[Paper] ASR Error Correction in Low-Resource Burmese with Alignment-Enhanced Transformers using Phonetic Features
Source: arXiv - 2511.21088v1
Overview
The paper presents the first systematic study of automatic speech‑recognition (ASR) error correction for Burmese, a language with very limited annotated speech data. By augmenting a standard Transformer‑based sequence‑to‑sequence model with phonetic (IPA) cues and alignment information, the authors achieve a sizable drop in word error rate (WER) and boost character‑level quality metrics, even when the underlying ASR system is weak.
Key Contributions
- First Burmese‑specific AEC research – establishes a benchmark for a truly low‑resource language.
- Feature‑enhanced Transformer architecture that injects (i) International Phonetic Alphabet (IPA) representations of the input text and (ii) token‑level alignment masks into the encoder‑decoder attention.
- Comprehensive evaluation across five diverse ASR backbones (CNN‑RNN, CTC, wav2vec‑2.0, etc.), showing consistent improvements regardless of the base model.
- Robustness analysis with and without data augmentation, demonstrating that the proposed AEC still yields gains when the ASR training data are artificially expanded.
- Open‑source release of the code, pretrained models, and a small Burmese speech‑text corpus for reproducibility.
Methodology
- Baseline ASR pipeline – Train five off‑the‑shelf ASR models on the same low‑resource Burmese corpus (≈ 30 h of transcribed speech).
- Error‑correction model (AEC) – A standard Transformer encoder‑decoder is modified in two ways:
- Phonetic embedding: Each input token is paired with its IPA transcription (generated via a rule‑based grapheme‑to‑phoneme converter). The IPA token is embedded and summed with the original word embedding, giving the model a pronunciation‑aware view of the text.
- Alignment mask: Using the ASR’s token‑level confidence scores and forced alignment, a binary mask tells the attention layers which positions are likely erroneous, encouraging the decoder to focus on correcting those spots.
- Training – The AEC is trained on pairs of raw ASR output → gold transcription using a cross‑entropy loss plus a small auxiliary loss that penalizes changes to high‑confidence tokens (to avoid over‑correction).
- Evaluation – Word Error Rate (WER) and chrF++ (character‑level F‑score) are computed on a held‑out test set. Experiments are run both on the raw ASR outputs and on outputs after simple data augmentation (speed‑perturbation, noise injection).
Results & Findings
| Metric | Avg. ASR (5 models) | + AEC (IPA + Alignment) | Δ Improvement |
|---|---|---|---|
| WER (no augmentation) | 51.56 % | 39.82 % | ‑11.74 % absolute |
| WER (with augmentation) | 51.56 % | 43.59 % | ‑7.97 % absolute |
| chrF++ (no augmentation) | 0.5864 | 0.627 | +0.0406 |
| chrF++ (with augmentation) | 0.5864 | 0.618 | +0.0316 |
- All five ASR backbones benefited from the same AEC model, confirming model‑agnostic robustness.
- Adding only IPA or only alignment gave modest gains; the combined configuration consistently outperformed either alone, highlighting the complementary nature of phonetic and positional cues.
- The AEC rarely introduced new errors on high‑confidence tokens, thanks to the auxiliary loss, which kept the correction focused on truly problematic regions.
Practical Implications
- Rapid quality lift for low‑resource speech products – Deploying an AEC layer on top of any existing Burmese ASR (or similar under‑resourced languages) can shave ~10 % off WER without retraining the acoustic model.
- Cost‑effective pipeline – Since the AEC operates on text, it sidesteps the need for more expensive acoustic data collection; developers can improve user‑facing voice assistants, transcription services, or captioning tools with a lightweight post‑processor.
- Phonetic‑aware NLP – The IPA embedding technique can be reused for other downstream tasks (e.g., spelling correction, language modeling) where pronunciation information is valuable.
- Open‑source toolkit – The authors provide a ready‑to‑run Docker image and scripts, making it easy for engineers to plug the correction model into existing speech pipelines (e.g., Kaldi, ESPnet, Hugging Face 🤗 Transformers).
- Transferability – The alignment‑mask concept works with any confidence‑scoring ASR, so the same approach could be adapted to languages like Khmer, Lao, or even dialectal variants of larger languages.
Limitations & Future Work
- Data size – The study is constrained to ~30 h of Burmese speech; performance on larger corpora or with more diverse speakers remains untested.
- Rule‑based IPA conversion – Errors in the grapheme‑to‑phoneme step can propagate to the AEC; a learned G2P model could improve robustness.
- Real‑time latency – Adding a Transformer‑based post‑processor introduces extra inference time; optimizing for on‑device or streaming scenarios is an open challenge.
- Cross‑language validation – While the authors hypothesize similar gains for other low‑resource languages, empirical verification is needed.
Bottom line: For developers building speech‑enabled applications in Burmese—or any language where high‑quality ASR data are scarce—stacking a phonetic‑ and alignment‑enhanced Transformer on top of the recognizer offers a pragmatic, plug‑and‑play route to markedly better transcription quality.
Authors
- Ye Bhone Lin
- Thura Aung
- Ye Kyaw Thu
- Thazin Myint Oo
Paper Information
- arXiv ID: 2511.21088v1
- Categories: cs.CL, cs.LG, cs.SD
- Published: November 26, 2025
- PDF: Download PDF