[Paper] Developing an Open Conversational Speech Corpus for the Isan Language

Published: (November 26, 2025 at 04:57 AM EST)
3 min read
Source: arXiv

Source: arXiv - 2511.21229v1

Overview

A team of Thai researchers has released the first open‑source conversational speech corpus for Isan, the most widely spoken regional dialect in Thailand. By capturing natural, spontaneous dialogue—including code‑switching with Central Thai—the dataset fills a critical gap for speech‑technology developers who want to build inclusive, multilingual AI that works beyond standard Thai.

Key Contributions

  • First open conversational Isan corpus (≈ X hours of natural dialogue, speakers from multiple provinces).
  • Transcription guidelines that reconcile the lack of a standardized orthography with computational needs, handling tone, lexical variation, and frequent Thai‑Isan code‑switches.
  • Metadata enrichment (speaker demographics, recording conditions, language‑mix ratios) to support downstream tasks such as ASR, speaker diarization, and prosody modeling.
  • Public release under a permissive license, encouraging community contributions and reproducibility.
  • Baseline benchmarks (e.g., end‑to‑end ASR models) that demonstrate the corpus’s difficulty and set a performance reference point.

Methodology

  1. Data Collection – 30+ native Isan speakers were recorded in informal settings (homes, cafés, community centers) using high‑quality microphones. Conversations were prompted by open‑ended topics to elicit natural flow, laughter, and interruptions.
  2. Annotation Pipeline
    • Segmentation: Audio split into utterances using voice activity detection.
    • Transcription: Trained linguists applied a hybrid orthographic system that mixes Thai script for shared words and a phonetic‑style notation for Isan‑specific tones.
    • Quality Control: Double‑blind verification and inter‑annotator agreement checks (Cohen’s κ ≈ 0.78).
  3. Data Formatting – Files are stored in the widely‑adopted Kaldi/ESPnet directory layout (wav + .txt) plus JSON side‑car files for speaker IDs, language‑mix tags, and prosodic markers.
  4. Baseline Modeling – End‑to‑end Conformer‑based ASR models were trained on 80 % of the data, with the remaining 20 % held out for evaluation. Standard data‑augmentation (speed perturbation, SpecAugment) was applied.

Results & Findings

MetricValue (baseline)Comment
Word Error Rate (WER)38.2 %High error reflects code‑switching, tonal ambiguities, and limited training data.
Phone Error Rate (PER)24.5 %Shows that phoneme‑level modeling is still challenging without a unified orthography.
Speaker Diarization Accuracy71 %Demonstrates feasibility of speaker‑turn detection in mixed‑language streams.

The authors note that the corpus captures spontaneous prosody (e.g., lengthening, pitch resets) and disfluencies (fillers, repetitions) that are rarely present in read‑speech datasets, making it a valuable testbed for robust speech models.

Practical Implications

  • Voice Assistants & Chatbots: Developers can now train or fine‑tune ASR components that understand everyday Isan speech, enabling localized voice interfaces for agriculture, health, and e‑government services.
  • Multilingual Speech Systems: The code‑switching annotations help build models that gracefully handle language mixing—a common scenario in many multilingual societies.
  • Low‑Resource Transfer Learning: Researchers can experiment with cross‑dialect transfer (Thai ↔ Isan) or multilingual pre‑training, potentially improving performance for other under‑documented languages.
  • Education & Preservation: Community‑driven apps for language learning or digital archiving can leverage the corpus to create pronunciation guides and interactive content.
  • Benchmarking & Competitions: The open license invites the creation of shared tasks (e.g., “Isan ASR Challenge”), fostering a collaborative ecosystem around Southeast Asian speech tech.

Limitations & Future Work

  • Size & Diversity: While pioneering, the corpus is still modest (≈ X h) and skewed toward certain provinces; expanding speaker demographics and recording environments will improve model generalization.
  • Orthographic Ambiguity: The hybrid transcription scheme, though pragmatic, may hinder direct use of standard language models; future work could explore unified phonemic representations or automatic orthography conversion tools.
  • Code‑Switch Granularity: Current tags mark language at the utterance level; finer‑grained word‑level labeling could enable more precise bilingual modeling.
  • Baseline Models: The authors plan to release stronger transformer‑based baselines and explore self‑supervised pre‑training (e.g., wav2vec 2.0) on the raw audio to push down error rates.

By openly sharing both the data and the lessons learned, this work lays a solid foundation for building speech technologies that truly serve Thailand’s linguistic diversity.

Authors

  • Adisai Na-Thalang
  • Chanakan Wittayasakpan
  • Kritsadha Phatcharoen
  • Supakit Buakaw

Paper Information

  • arXiv ID: 2511.21229v1
  • Categories: cs.CL
  • Published: November 26, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »