[Paper] SpokenUS: A Spoken User Simulator for Task-Oriented Dialogue

Published: (March 17, 2026 at 12:58 PM EDT)
5 min read
Source: arXiv

Source: arXiv - 2603.16783v1

Overview

The paper introduces SpokenUS, a spoken‑language user simulator designed for training and evaluating task‑oriented dialogue (TOD) systems. By first releasing a massive spoken dialogue corpus—SpokenTOD, with over 52 k dialogues and 1 034 h of audio—the authors demonstrate how realistic speech phenomena (cross‑turn slot mentions, barge‑in, disfluencies, and emotional prosody) can be systematically injected into data, enabling a more faithful simulation of how real users converse with voice assistants.

Key Contributions

  • SpokenTOD dataset: 52 390 spoken TOD dialogues covering multiple domains, annotated with four user behaviors (cross‑turn slots, barge‑in, disfluency, emotional prosody).
  • SpokenUS simulator: A modular architecture that generates spoken user utterances with the above behaviors, including a dedicated barge‑in module that lets the user interrupt the system mid‑response.
  • Goal‑coverage parity: Despite being far smaller than generic large‑language models, SpokenUS matches them in the variety of user goals it can express.
  • Human evaluation advantage: MOS (Mean Opinion Score) tests show SpokenUS produces more natural, human‑like speech than baseline simulators, especially in the gradual revelation of slot values.
  • Open‑source pipeline: The authors release code and data augmentation scripts, providing a reproducible way to enrich existing TOD corpora with realistic spoken phenomena.

Methodology

  1. Data Augmentation – Starting from existing text‑based TOD corpora, the team applied rule‑based and neural transformations to insert the four target behaviors. For example, cross‑turn slots were delayed to later turns, disfluencies (e.g., “uh”, “um”) were injected using a trained filler‑insertion model, and emotional prosody was added by conditioning a TTS system on affect labels.
  2. SpokenUS Architecture – The simulator consists of three tightly coupled modules:
    • Goal Planner: selects a user goal and decides the order of slot requests.
    • Behavior Controller: decides at each turn whether to barge‑in, add a disfluency, or modify prosody, based on a learned policy that mimics human turn‑taking statistics.
    • Speech Generator: renders the final utterance with a neural TTS model that can vary pitch, rate, and intensity to convey the chosen emotion.
  3. Training & Evaluation – The behavior controller is trained on the augmented SpokenTOD data using supervised learning, while the TTS component is fine‑tuned on the same audio to capture the prosodic patterns. Human judges rated naturalness (MOS) and the realism of slot‑value timing, and automatic metrics measured goal coverage and dialogue success rates.

Results & Findings

MetricSpokenUSBaseline SimulatorsLarge‑LM (e.g., GPT‑4)
Goal coverage (unique goal combos)≈ 98 % of large‑LM85 %100 %
Human MOS (naturalness)4.2 / 53.5 / 54.0 / 5
Slot‑value reveal timing (human‑like)Gradual, 78 % match human patterns45 % (often front‑loaded)70 %
Barge‑in handling success (agent error rate)12 % error28 % error15 % error

Key takeaways

  • SpokenUS produces utterances that humans rate as more natural than existing simulators and even competitive with large language models, despite being far smaller.
  • The simulator’s ability to delay slot disclosure mirrors real user behavior, which is crucial for training agents that must ask clarifying questions.
  • Introducing barge‑in and emotional prosody creates measurable stress tests for downstream dialogue managers, exposing weaknesses that text‑only training misses.

Practical Implications

  • Robust Voice Assistant Development – Teams can plug SpokenUS into their training pipelines to expose their dialogue policies to realistic interruptions and hesitations, reducing failure cases when the product reaches real users.
  • Automated Testing – The simulator can generate thousands of varied spoken interactions on demand, enabling continuous integration (CI) testing of speech‑recognition, intent‑classification, and policy‑selection components.
  • Domain Expansion – Because the augmentation pipeline is domain‑agnostic, developers can quickly adapt existing text‑based datasets (e.g., restaurant booking, travel) into spoken form, saving months of data collection.
  • Emotion‑aware Systems – By providing emotional prosody, SpokenUS helps developers prototype agents that adapt responses based on user affect (e.g., calming tone for frustrated users).
  • Open‑source Ecosystem – The released code and data lower the barrier for startups and research labs to build more resilient spoken dialogue agents without needing massive in‑house speech corpora.

Limitations & Future Work

  • Speaker Diversity – While SpokenTOD includes many speakers, the acoustic variety still falls short of the full range of accents, dialects, and background noises encountered in the wild.
  • Rule‑Based Augmentation Bias – Some behavior insertions rely on handcrafted rules, which may not capture all nuanced human speech patterns.
  • Scalability of Emotional Labels – The current prosody model uses a limited set of emotion categories; richer affective states remain unexplored.
  • Evaluation Scope – Human MOS was collected on a subset of domains; broader user studies (e.g., longitudinal interaction) are needed to confirm long‑term benefits.

Future directions include expanding the speaker pool with crowd‑sourced recordings, integrating end‑to‑end neural augmentation (removing rule‑based steps), and extending the simulator to multimodal contexts (e.g., visual cues alongside speech).

Authors

  • Jonggeun Lee
  • Junseong Pyo
  • Jeongmin Park
  • Yohan Jo

Paper Information

  • arXiv ID: 2603.16783v1
  • Categories: cs.CL
  • Published: March 17, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »