[Paper] Reference-Driven Multi-Speaker Audio Scene Generation from In-the-Wild Priors
Source: arXiv - 2606.19325v1
Overview
Existing multi-speaker dialogue systems bind speakers to utterances through structured supervision: per-turn tags, multi-stream transcriptions, or learnable speaker embeddings. These systems operate within speech-only pipelines that produce clean vocal sequences without the ambient texture of real conversations. We take a different approach. Our method, ScenA, conditions a text-to-audio flow-matching foundation model, pretrained on large-scale in-the-wild data, directly on multiple reference voices and a free-form natural language prompt that describes an entire multi-speaker audio scene. Leveraging such a foundational model allows us to inherit its capacity for natural, non-studio audio: background noise, room acoustics, overlapping dialogue, and spontaneous paralinguistic events, while adding multi-speaker control without any per-turn structure. Concretely, reference latents are concatenated into the model’s token sequence and distinguished by lightweight identity-aware positional encodings. However, we identify a critical obstacle to this approach: the \textit{Reference Shortcut}. During training under standard noise schedules, the model can identify the matching reference by acoustic similarity to the noisy target, bypassing the text prompt entirely. We address this with a high-noise-biased timestep distribution that forces the model to rely on the text prompt for speaker assignment. We evaluate ScenA on the CoVoMix2-Dialogue benchmark, showing that it outperforms existing multi-speaker systems on speaker-binding metrics while generating rich conversational audio with overlapping speech, emotional vocalizations, and ambient sound. Our results demonstrate the advantage of using a general-purpose audio model conditioned on a free-form scene description, rather than passing structured dialog scripts through a speech-only pipeline.
Key Contributions
This paper presents research in the following areas:
- cs.SD
- cs.AI
- cs.CV
Methodology
Please refer to the full paper for detailed methodology.
Practical Implications
This research contributes to the advancement of cs.SD.
Authors
- Michael Finkelson
- Daniel Segal
- Eitan Richardson
- Shahar Armon
- Nani Goldring
- Poriya Panet
- Nir Zabari
- Benjamin Brazowski
- Or Patashnik
- Yoav HaCohen
Paper Information
- arXiv ID: 2606.19325v1
- Categories: cs.SD, cs.AI, cs.CV
- Published: June 17, 2026
- PDF: Download PDF