[Paper] Do What I Say: A Spoken Prompt Dataset for Instruction-Following

Published: 14 hours ago (March 10, 2026 at 12:39 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2603.09881v1

Overview

The paper introduces DoWhatISay (DOWIS), a multilingual spoken‑prompt dataset that lets researchers and engineers evaluate Speech Large Language Models (SLLMs) with realistic, voice‑based instructions instead of the usual text‑only prompts. By covering nine tasks, eleven languages, and multiple speaking styles, DOWIS reveals how current SLLMs handle the nuances of spoken interaction—a scenario that matters for voice assistants, transcription tools, and any product that expects users to talk to AI.

Key Contributions

A first‑of‑its‑kind spoken‑prompt benchmark that can be overlaid on any existing text‑based dataset, turning it into a “speech‑ready” evaluation suite.
Broad multilingual coverage: 11 languages (including low‑resource ones) and 9 diverse tasks such as translation, summarization, question answering, and speech‑to‑speech generation.
Rich prompt variability: 10 prompt variants per task‑language pair, spanning five speaking styles (e.g., formal, casual, accented, noisy).
Comprehensive SLLM evaluation: State‑of‑the‑art models are tested across prompt modality (text vs. spoken), style, language, and task type.
Insightful analysis showing where spoken prompts hurt performance and where they close the gap (notably for speech‑output tasks).

Methodology

Dataset Construction
- Started from established text‑prompt benchmarks (e.g., XNLI, CoVoST).
- For each task‑language pair, native speakers recorded ten spoken versions of the original text prompt, deliberately varying prosody, speed, and background noise to capture real‑world usage.
- All recordings were transcribed and time‑aligned, creating a parallel “spoken ↔︎ text” prompt pair.
Prompt Styles
- Five styles were defined: formal, casual, instructional, emphatic, and noisy (simulated background).
- Each style was represented across the ten variants, giving models exposure to a spectrum of user behaviours.
Evaluation Pipeline
- Existing SLLMs (e.g., Whisper‑LLM, SpeechGPT) were fed either the raw audio prompt (via their speech encoder) or the original text prompt.
- Model outputs were compared against the gold standard using task‑specific metrics (BLEU for translation, ROUGE for summarization, accuracy for QA, etc.).
- Experiments were run in three settings: monolingual, cross‑lingual (prompt language ≠ output language), and low‑resource (languages with ≤ 1 M speakers).

Results & Findings

Setting	Text Prompt Accuracy	Spoken Prompt Accuracy	Gap
High‑resource monolingual	88 %	73 %	15 pts
Low‑resource (e.g., Swahili)	62 %	44 %	18 pts
Cross‑lingual (EN→FR)	81 %	66 %	15 pts
Speech‑output tasks (e.g., TTS)	79 %	77 %	2 pts

Text prompts consistently outperform spoken prompts, especially for low‑resource languages and cross‑lingual transfer.
The performance drop is largely style‑dependent; noisy and emphatic styles cause the biggest degradation.
When the task’s output is speech, spoken prompts nearly match text prompts, indicating that the speech‑to‑speech pipeline can compensate for input modality loss.
Error analysis shows that ASR errors (mis‑recognised keywords) are the primary driver of the performance gap, rather than downstream language‑model weaknesses.

Practical Implications

Voice‑first product design: Developers building voice assistants should anticipate a 10‑20 % performance dip when users issue spoken instructions, especially in multilingual or noisy environments.
Model selection: For applications that require speech‑to‑speech interaction (e.g., real‑time translation devices), current SLLMs are already close to text‑prompt performance, making them viable today.
Data augmentation: Incorporating spoken‑prompt data (like DOWIS) during fine‑tuning can reduce the ASR‑induced error cascade, a practical recipe for improving robustness.
Testing pipelines: DOWIS offers a plug‑and‑play benchmark for CI/CD of voice‑enabled AI services, enabling teams to catch regressions that only appear under spoken input.
Low‑resource language support: The pronounced gap highlights the need for better multilingual ASR components; investing in language‑specific acoustic models will pay off for global products.

Limitations & Future Work

ASR dependency: The study isolates the SLLM but does not jointly optimise the speech recogniser and language model, which could mask potential gains from end‑to‑end training.
Prompt diversity ceiling: While five styles cover many real‑world cases, they omit domain‑specific jargon (e.g., medical or legal speech) that could further stress models.
Scalability: Extending DOWIS to more than 11 languages and to additional tasks (code generation, reasoning) remains an open engineering challenge.
Future directions suggested by the authors include: (1) training SLLMs with multimodal prompt mixtures (text + audio), (2) exploring self‑supervised adaptation on user‑generated speech, and (3) integrating noise‑robust ASR front‑ends to shrink the spoken‑prompt performance gap.

Authors

Maike Züfle
Sara Papi
Fabian Retkowski
Szymon Mazurek
Marek Kasztelnik
Alexander Waibel
Luisa Bentivogli
Jan Niehues

Paper Information

arXiv ID: 2603.09881v1
Categories: cs.CL
Published: March 10, 2026
PDF: Download PDF

[Paper] Do What I Say: A Spoken Prompt Dataset for Instruction-Following

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] CREATE: Testing LLMs for Associative Creativity

[Paper] Think Before You Lie: How Reasoning Improves Honesty

[Paper] Model Merging in the Era of Large Language Models: Methods, Applications, and Future Directions

[Paper] Thinking to Recall: How Reasoning Unlocks Parametric Knowledge in LLMs