[Paper] ASR Error Correction in Low-Resource Burmese with Alignment-Enhanced Transformers using Phonetic Features

Published: 1 month ago (November 26, 2025 at 01:13 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2511.21088v1

Overview

The paper presents the first systematic study of automatic speech‑recognition (ASR) error correction for Burmese, a language with very limited annotated speech data. By augmenting a standard Transformer‑based sequence‑to‑sequence model with phonetic (IPA) cues and alignment information, the authors achieve a sizable drop in word error rate (WER) and boost character‑level quality metrics, even when the underlying ASR system is weak.

Key Contributions

First Burmese‑specific AEC research – establishes a benchmark for a truly low‑resource language.
Feature‑enhanced Transformer architecture that injects (i) International Phonetic Alphabet (IPA) representations of the input text and (ii) token‑level alignment masks into the encoder‑decoder attention.
Comprehensive evaluation across five diverse ASR backbones (CNN‑RNN, CTC, wav2vec‑2.0, etc.), showing consistent improvements regardless of the base model.
Robustness analysis with and without data augmentation, demonstrating that the proposed AEC still yields gains when the ASR training data are artificially expanded.
Open‑source release of the code, pretrained models, and a small Burmese speech‑text corpus for reproducibility.

Methodology

Baseline ASR pipeline – Train five off‑the‑shelf ASR models on the same low‑resource Burmese corpus (≈ 30 h of transcribed speech).
Error‑correction model (AEC) – A standard Transformer encoder‑decoder is modified in two ways:
- Phonetic embedding: Each input token is paired with its IPA transcription (generated via a rule‑based grapheme‑to‑phoneme converter). The IPA token is embedded and summed with the original word embedding, giving the model a pronunciation‑aware view of the text.
- Alignment mask: Using the ASR’s token‑level confidence scores and forced alignment, a binary mask tells the attention layers which positions are likely erroneous, encouraging the decoder to focus on correcting those spots.
Training – The AEC is trained on pairs of raw ASR output → gold transcription using a cross‑entropy loss plus a small auxiliary loss that penalizes changes to high‑confidence tokens (to avoid over‑correction).
Evaluation – Word Error Rate (WER) and chrF++ (character‑level F‑score) are computed on a held‑out test set. Experiments are run both on the raw ASR outputs and on outputs after simple data augmentation (speed‑perturbation, noise injection).

Results & Findings

Metric	Avg. ASR (5 models)	+ AEC (IPA + Alignment)	Δ Improvement
WER (no augmentation)	51.56 %	39.82 %	‑11.74 % absolute
WER (with augmentation)	51.56 %	43.59 %	‑7.97 % absolute
chrF++ (no augmentation)	0.5864	0.627	+0.0406
chrF++ (with augmentation)	0.5864	0.618	+0.0316

All five ASR backbones benefited from the same AEC model, confirming model‑agnostic robustness.
Adding only IPA or only alignment gave modest gains; the combined configuration consistently outperformed either alone, highlighting the complementary nature of phonetic and positional cues.
The AEC rarely introduced new errors on high‑confidence tokens, thanks to the auxiliary loss, which kept the correction focused on truly problematic regions.

Practical Implications

Rapid quality lift for low‑resource speech products – Deploying an AEC layer on top of any existing Burmese ASR (or similar under‑resourced languages) can shave ~10 % off WER without retraining the acoustic model.
Cost‑effective pipeline – Since the AEC operates on text, it sidesteps the need for more expensive acoustic data collection; developers can improve user‑facing voice assistants, transcription services, or captioning tools with a lightweight post‑processor.
Phonetic‑aware NLP – The IPA embedding technique can be reused for other downstream tasks (e.g., spelling correction, language modeling) where pronunciation information is valuable.
Open‑source toolkit – The authors provide a ready‑to‑run Docker image and scripts, making it easy for engineers to plug the correction model into existing speech pipelines (e.g., Kaldi, ESPnet, Hugging Face 🤗 Transformers).
Transferability – The alignment‑mask concept works with any confidence‑scoring ASR, so the same approach could be adapted to languages like Khmer, Lao, or even dialectal variants of larger languages.

Limitations & Future Work

Data size – The study is constrained to ~30 h of Burmese speech; performance on larger corpora or with more diverse speakers remains untested.
Rule‑based IPA conversion – Errors in the grapheme‑to‑phoneme step can propagate to the AEC; a learned G2P model could improve robustness.
Real‑time latency – Adding a Transformer‑based post‑processor introduces extra inference time; optimizing for on‑device or streaming scenarios is an open challenge.
Cross‑language validation – While the authors hypothesize similar gains for other low‑resource languages, empirical verification is needed.

Bottom line: For developers building speech‑enabled applications in Burmese—or any language where high‑quality ASR data are scarce—stacking a phonetic‑ and alignment‑enhanced Transformer on top of the recognizer offers a pragmatic, plug‑and‑play route to markedly better transcription quality.

Authors

Ye Bhone Lin
Thura Aung
Ye Kyaw Thu
Thazin Myint Oo

Paper Information

arXiv ID: 2511.21088v1
Categories: cs.CL, cs.LG, cs.SD
Published: November 26, 2025
PDF: Download PDF

[Paper] ASR Error Correction in Low-Resource Burmese with Alignment-Enhanced Transformers using Phonetic Features

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] BanglaASTE: A Novel Framework for Aspect-Sentiment-Opinion Extraction in Bangla E-commerce Reviews Using Ensemble Deep Learning

[Paper] Developing an Open Conversational Speech Corpus for the Isan Language

[Paper] Enhancing Burmese News Classification with Kolmogorov-Arnold Network Head Fine-tuning

AI agents find $4.6M in blockchain smart contract exploits