[Paper] Speaker-Aware Simulation Improves Conversational Speech Recognition
Source: arXiv - 2602.04776v1
Overview
The paper explores how to boost automatic speech recognition (ASR) for everyday conversations by turning single‑speaker recordings into realistic multi‑speaker dialogues. By adapting the Speaker‑Aware Simulation (SASC) technique to Hungarian—and extending it with a new C‑SASC variant that better models pauses—the authors show that synthetic conversation data can meaningfully improve ASR performance, even for a lower‑resource language.
Key Contributions
- Adaptation of SASC to Hungarian – Demonstrates that the speaker‑aware simulation pipeline, previously validated only on English, works well for a typologically different, lower‑resource language.
- Introduction of C‑SASC – Adds a duration‑conditioned pause model that captures the fine‑grained timing patterns of natural turn‑taking.
- Large‑scale synthetic dialogue generation – Produces thousands of Hungarian conversation utterances from the BEA‑Large single‑speaker corpus using statistics from three real conversational corpora (CallHome, BEA‑Dialogue, GRASS).
- Comprehensive evaluation – Benchmarks SASC and C‑SASC against naive concatenation across multiple simulation settings and reports consistent gains in word‑ and character‑error rates.
- Insight into statistical matching – Shows that the benefit of C‑SASC hinges on how closely the simulated turn‑taking statistics align with the target domain.
Methodology
- Base Corpus – The authors start with the BEA‑Large dataset, which contains clean, single‑speaker Hungarian speech recordings and transcriptions.
- Speaker‑Aware Simulation (SASC)
- Randomly assign each utterance a synthetic speaker ID.
- Concatenate utterances from different speakers according to a turn‑taking distribution (e.g., probability of a speaker change after a given number of words).
- Insert short silences between turns to mimic natural pauses.
- C‑SASC Extension
- Augments step 2 by conditioning pause length on the duration of the preceding utterance.
- Uses empirical pause‑duration curves derived from real Hungarian dialogues, so longer utterances tend to be followed by longer gaps, reflecting human conversational rhythm.
- Statistical Sources – Turn‑taking and pause statistics are extracted from three corpora:
- CallHome (telephone conversations)
- BEA‑Dialogue (in‑house Hungarian dialogues)
- GRASS (spontaneous speech)
- Training Pipeline – Synthetic dialogues are mixed with the limited amount of genuine conversational data. A standard end‑to‑end transformer‑based ASR model is trained on the combined set.
- Evaluation – Models are tested on held‑out Hungarian conversational test sets, reporting Word Error Rate (WER) and Character Error Rate (CER).
Results & Findings
| System | WER ↓ | CER ↓ |
|---|---|---|
| Baseline (real data only) | 23.5 % | 12.8 % |
| Baseline + naive concatenation | 22.9 % | 12.4 % |
| Baseline + SASC (best config) | 21.7 % | 11.6 % |
| Baseline + C‑SASC (matched stats) | 21.4 % | 11.3 % |
- SASC consistently outperforms naive concatenation, confirming that speaker‑aware turn modeling adds useful acoustic variability.
- C‑SASC delivers modest but systematic improvements, especially in CER, indicating better handling of fine‑grained timing cues.
- Gains are largest when the simulated turn‑taking statistics closely match the test domain (e.g., using CallHome stats for telephone‑style evaluation).
- The improvements hold across different model sizes, suggesting the approach is model‑agnostic.
Practical Implications
- Data‑efficient ASR development – Teams building speech recognizers for languages with limited conversational corpora can generate high‑quality synthetic dialogues from existing single‑speaker recordings, reducing the need for costly multi‑speaker annotation.
- Rapid prototyping for voice assistants – By plugging in a language‑specific single‑speaker dataset, developers can quickly produce a conversational ASR model suitable for chat‑bots, call‑center automation, or smart‑home devices.
- Domain adaptation – Adjusting the turn‑taking and pause statistics to match a target use‑case (e.g., call‑center vs. casual chat) can tailor the synthetic data for better performance without collecting new recordings.
- Open‑source pipeline potential – The SASC/C‑SASC workflow is lightweight (no complex TTS or speaker‑conversion models) and can be integrated into existing data‑augmentation scripts for any end‑to‑end ASR toolkit (e.g., ESPnet, Kaldi, Whisper‑style models).
Limitations & Future Work
- Statistical dependency – The benefit of C‑SASC diminishes when simulated statistics diverge from the target domain, highlighting a reliance on accurate turn‑taking data.
- Synthetic realism ceiling – While SASC improves acoustic diversity, it does not capture higher‑level discourse phenomena (e.g., back‑channeling, overlapping speech).
- Language‑specific tuning – The pause‑conditioning model was handcrafted for Hungarian; extending it to languages with different prosodic patterns may require additional research.
- Future directions suggested by the authors include:
- Incorporating overlap modeling and speaker emotion cues into the simulation.
- Exploring self‑supervised pre‑training on synthetic dialogues to further reduce reliance on real conversational data.
- Scaling the approach to multilingual settings where a single source corpus can be repurposed for several low‑resource languages.
Authors
- Máté Gedeon
- Péter Mihajlik
Paper Information
- arXiv ID: 2602.04776v1
- Categories: cs.SD, cs.CL, eess.AS
- Published: February 4, 2026
- PDF: Download PDF