[Paper] Do What I Say: A Spoken Prompt Dataset for Instruction-Following
Source: arXiv - 2603.09881v1
Overview
The paper introduces DoWhatISay (DOWIS), a multilingual spoken‑prompt dataset that lets researchers and engineers evaluate Speech Large Language Models (SLLMs) with realistic, voice‑based instructions instead of the usual text‑only prompts. By covering nine tasks, eleven languages, and multiple speaking styles, DOWIS reveals how current SLLMs handle the nuances of spoken interaction—a scenario that matters for voice assistants, transcription tools, and any product that expects users to talk to AI.
Key Contributions
- A first‑of‑its‑kind spoken‑prompt benchmark that can be overlaid on any existing text‑based dataset, turning it into a “speech‑ready” evaluation suite.
- Broad multilingual coverage: 11 languages (including low‑resource ones) and 9 diverse tasks such as translation, summarization, question answering, and speech‑to‑speech generation.
- Rich prompt variability: 10 prompt variants per task‑language pair, spanning five speaking styles (e.g., formal, casual, accented, noisy).
- Comprehensive SLLM evaluation: State‑of‑the‑art models are tested across prompt modality (text vs. spoken), style, language, and task type.
- Insightful analysis showing where spoken prompts hurt performance and where they close the gap (notably for speech‑output tasks).
Methodology
-
Dataset Construction
- Started from established text‑prompt benchmarks (e.g., XNLI, CoVoST).
- For each task‑language pair, native speakers recorded ten spoken versions of the original text prompt, deliberately varying prosody, speed, and background noise to capture real‑world usage.
- All recordings were transcribed and time‑aligned, creating a parallel “spoken ↔︎ text” prompt pair.
-
Prompt Styles
- Five styles were defined: formal, casual, instructional, emphatic, and noisy (simulated background).
- Each style was represented across the ten variants, giving models exposure to a spectrum of user behaviours.
-
Evaluation Pipeline
- Existing SLLMs (e.g., Whisper‑LLM, SpeechGPT) were fed either the raw audio prompt (via their speech encoder) or the original text prompt.
- Model outputs were compared against the gold standard using task‑specific metrics (BLEU for translation, ROUGE for summarization, accuracy for QA, etc.).
- Experiments were run in three settings: monolingual, cross‑lingual (prompt language ≠ output language), and low‑resource (languages with ≤ 1 M speakers).
Results & Findings
| Setting | Text Prompt Accuracy | Spoken Prompt Accuracy | Gap |
|---|---|---|---|
| High‑resource monolingual | 88 % | 73 % | 15 pts |
| Low‑resource (e.g., Swahili) | 62 % | 44 % | 18 pts |
| Cross‑lingual (EN→FR) | 81 % | 66 % | 15 pts |
| Speech‑output tasks (e.g., TTS) | 79 % | 77 % | 2 pts |
- Text prompts consistently outperform spoken prompts, especially for low‑resource languages and cross‑lingual transfer.
- The performance drop is largely style‑dependent; noisy and emphatic styles cause the biggest degradation.
- When the task’s output is speech, spoken prompts nearly match text prompts, indicating that the speech‑to‑speech pipeline can compensate for input modality loss.
- Error analysis shows that ASR errors (mis‑recognised keywords) are the primary driver of the performance gap, rather than downstream language‑model weaknesses.
Practical Implications
- Voice‑first product design: Developers building voice assistants should anticipate a 10‑20 % performance dip when users issue spoken instructions, especially in multilingual or noisy environments.
- Model selection: For applications that require speech‑to‑speech interaction (e.g., real‑time translation devices), current SLLMs are already close to text‑prompt performance, making them viable today.
- Data augmentation: Incorporating spoken‑prompt data (like DOWIS) during fine‑tuning can reduce the ASR‑induced error cascade, a practical recipe for improving robustness.
- Testing pipelines: DOWIS offers a plug‑and‑play benchmark for CI/CD of voice‑enabled AI services, enabling teams to catch regressions that only appear under spoken input.
- Low‑resource language support: The pronounced gap highlights the need for better multilingual ASR components; investing in language‑specific acoustic models will pay off for global products.
Limitations & Future Work
- ASR dependency: The study isolates the SLLM but does not jointly optimise the speech recogniser and language model, which could mask potential gains from end‑to‑end training.
- Prompt diversity ceiling: While five styles cover many real‑world cases, they omit domain‑specific jargon (e.g., medical or legal speech) that could further stress models.
- Scalability: Extending DOWIS to more than 11 languages and to additional tasks (code generation, reasoning) remains an open engineering challenge.
- Future directions suggested by the authors include: (1) training SLLMs with multimodal prompt mixtures (text + audio), (2) exploring self‑supervised adaptation on user‑generated speech, and (3) integrating noise‑robust ASR front‑ends to shrink the spoken‑prompt performance gap.
Authors
- Maike Züfle
- Sara Papi
- Fabian Retkowski
- Szymon Mazurek
- Marek Kasztelnik
- Alexander Waibel
- Luisa Bentivogli
- Jan Niehues
Paper Information
- arXiv ID: 2603.09881v1
- Categories: cs.CL
- Published: March 10, 2026
- PDF: Download PDF