[Paper] Reference-Driven Multi-Speaker Audio Scene Generation from In-the-Wild Priors

Published: 1 day ago (June 17, 2026 at 01:51 PM EDT)

2 min read

Source: arXiv

Source: arXiv - 2606.19325v1

Overview

Existing multi-speaker dialogue systems bind speakers to utterances through structured supervision: per-turn tags, multi-stream transcriptions, or learnable speaker embeddings. These systems operate within speech-only pipelines that produce clean vocal sequences without the ambient texture of real conversations. We take a different approach. Our method, ScenA, conditions a text-to-audio flow-matching foundation model, pretrained on large-scale in-the-wild data, directly on multiple reference voices and a free-form natural language prompt that describes an entire multi-speaker audio scene. Leveraging such a foundational model allows us to inherit its capacity for natural, non-studio audio: background noise, room acoustics, overlapping dialogue, and spontaneous paralinguistic events, while adding multi-speaker control without any per-turn structure. Concretely, reference latents are concatenated into the model’s token sequence and distinguished by lightweight identity-aware positional encodings. However, we identify a critical obstacle to this approach: the \textit{Reference Shortcut}. During training under standard noise schedules, the model can identify the matching reference by acoustic similarity to the noisy target, bypassing the text prompt entirely. We address this with a high-noise-biased timestep distribution that forces the model to rely on the text prompt for speaker assignment. We evaluate ScenA on the CoVoMix2-Dialogue benchmark, showing that it outperforms existing multi-speaker systems on speaker-binding metrics while generating rich conversational audio with overlapping speech, emotional vocalizations, and ambient sound. Our results demonstrate the advantage of using a general-purpose audio model conditioned on a free-form scene description, rather than passing structured dialog scripts through a speech-only pipeline.

Key Contributions

This paper presents research in the following areas:

cs.SD
cs.AI
cs.CV

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.SD.

Authors

Michael Finkelson
Daniel Segal
Eitan Richardson
Shahar Armon
Nani Goldring
Poriya Panet
Nir Zabari
Benjamin Brazowski
Or Patashnik
Yoav HaCohen

Paper Information

arXiv ID: 2606.19325v1
Categories: cs.SD, cs.AI, cs.CV
Published: June 17, 2026
PDF: Download PDF

[Paper] Reference-Driven Multi-Speaker Audio Scene Generation from In-the-Wild Priors

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] Confidence is Not Reliability: Rethinking MC Dropout in Brain Tumour Segmentation

[Paper] A Multi-Domain Benchmark for Detecting AI-Generated Text-Rich Images from GPT-Image-2

[Paper] OneCanvas: 3D Scene Understanding via Panoramic Reprojection

[Paper] Transformer Geometry Observatory TGO-I: Spectral Geometry Observatory