[Paper] SpokenUS: A Spoken User Simulator for Task-Oriented Dialogue

Published: 3 days ago (March 17, 2026 at 12:58 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2603.16783v1

Overview

The paper introduces SpokenUS, a spoken‑language user simulator designed for training and evaluating task‑oriented dialogue (TOD) systems. By first releasing a massive spoken dialogue corpus—SpokenTOD, with over 52 k dialogues and 1 034 h of audio—the authors demonstrate how realistic speech phenomena (cross‑turn slot mentions, barge‑in, disfluencies, and emotional prosody) can be systematically injected into data, enabling a more faithful simulation of how real users converse with voice assistants.

Key Contributions

SpokenTOD dataset: 52 390 spoken TOD dialogues covering multiple domains, annotated with four user behaviors (cross‑turn slots, barge‑in, disfluency, emotional prosody).
SpokenUS simulator: A modular architecture that generates spoken user utterances with the above behaviors, including a dedicated barge‑in module that lets the user interrupt the system mid‑response.
Goal‑coverage parity: Despite being far smaller than generic large‑language models, SpokenUS matches them in the variety of user goals it can express.
Human evaluation advantage: MOS (Mean Opinion Score) tests show SpokenUS produces more natural, human‑like speech than baseline simulators, especially in the gradual revelation of slot values.
Open‑source pipeline: The authors release code and data augmentation scripts, providing a reproducible way to enrich existing TOD corpora with realistic spoken phenomena.

Methodology

Data Augmentation – Starting from existing text‑based TOD corpora, the team applied rule‑based and neural transformations to insert the four target behaviors. For example, cross‑turn slots were delayed to later turns, disfluencies (e.g., “uh”, “um”) were injected using a trained filler‑insertion model, and emotional prosody was added by conditioning a TTS system on affect labels.
SpokenUS Architecture – The simulator consists of three tightly coupled modules:
- Goal Planner: selects a user goal and decides the order of slot requests.
- Behavior Controller: decides at each turn whether to barge‑in, add a disfluency, or modify prosody, based on a learned policy that mimics human turn‑taking statistics.
- Speech Generator: renders the final utterance with a neural TTS model that can vary pitch, rate, and intensity to convey the chosen emotion.
Training & Evaluation – The behavior controller is trained on the augmented SpokenTOD data using supervised learning, while the TTS component is fine‑tuned on the same audio to capture the prosodic patterns. Human judges rated naturalness (MOS) and the realism of slot‑value timing, and automatic metrics measured goal coverage and dialogue success rates.

Results & Findings

Metric	SpokenUS	Baseline Simulators	Large‑LM (e.g., GPT‑4)
Goal coverage (unique goal combos)	≈ 98 % of large‑LM	85 %	100 %
Human MOS (naturalness)	4.2 / 5	3.5 / 5	4.0 / 5
Slot‑value reveal timing (human‑like)	Gradual, 78 % match human patterns	45 % (often front‑loaded)	70 %
Barge‑in handling success (agent error rate)	12 % error	28 % error	15 % error

Key takeaways

SpokenUS produces utterances that humans rate as more natural than existing simulators and even competitive with large language models, despite being far smaller.
The simulator’s ability to delay slot disclosure mirrors real user behavior, which is crucial for training agents that must ask clarifying questions.
Introducing barge‑in and emotional prosody creates measurable stress tests for downstream dialogue managers, exposing weaknesses that text‑only training misses.

Practical Implications

Robust Voice Assistant Development – Teams can plug SpokenUS into their training pipelines to expose their dialogue policies to realistic interruptions and hesitations, reducing failure cases when the product reaches real users.
Automated Testing – The simulator can generate thousands of varied spoken interactions on demand, enabling continuous integration (CI) testing of speech‑recognition, intent‑classification, and policy‑selection components.
Domain Expansion – Because the augmentation pipeline is domain‑agnostic, developers can quickly adapt existing text‑based datasets (e.g., restaurant booking, travel) into spoken form, saving months of data collection.
Emotion‑aware Systems – By providing emotional prosody, SpokenUS helps developers prototype agents that adapt responses based on user affect (e.g., calming tone for frustrated users).
Open‑source Ecosystem – The released code and data lower the barrier for startups and research labs to build more resilient spoken dialogue agents without needing massive in‑house speech corpora.

Limitations & Future Work

Speaker Diversity – While SpokenTOD includes many speakers, the acoustic variety still falls short of the full range of accents, dialects, and background noises encountered in the wild.
Rule‑Based Augmentation Bias – Some behavior insertions rely on handcrafted rules, which may not capture all nuanced human speech patterns.
Scalability of Emotional Labels – The current prosody model uses a limited set of emotion categories; richer affective states remain unexplored.
Evaluation Scope – Human MOS was collected on a subset of domains; broader user studies (e.g., longitudinal interaction) are needed to confirm long‑term benefits.

Future directions include expanding the speaker pool with crowd‑sourced recordings, integrating end‑to‑end neural augmentation (removing rule‑based steps), and extending the simulator to multimodal contexts (e.g., visual cues alongside speech).

Authors

Jonggeun Lee
Junseong Pyo
Jeongmin Park
Yohan Jo

Paper Information

arXiv ID: 2603.16783v1
Categories: cs.CL
Published: March 17, 2026
PDF: Download PDF

[Paper] SpokenUS: A Spoken User Simulator for Task-Oriented Dialogue

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] FinTradeBench: A Financial Reasoning Benchmark for LLMs

[Paper] F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World

[Paper] Online Learning and Equilibrium Computation with Ranking Feedback

[Paper] Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation