[Paper] Reference-Driven Multi-Speaker Audio Scene Generation from In-the-Wild Priors

Published: (June 17, 2026 at 01:51 PM EDT)
2 min read
Source: arXiv

Source: arXiv - 2606.19325v1

Overview

Existing multi-speaker dialogue systems bind speakers to utterances through structured supervision: per-turn tags, multi-stream transcriptions, or learnable speaker embeddings. These systems operate within speech-only pipelines that produce clean vocal sequences without the ambient texture of real conversations. We take a different approach. Our method, ScenA, conditions a text-to-audio flow-matching foundation model, pretrained on large-scale in-the-wild data, directly on multiple reference voices and a free-form natural language prompt that describes an entire multi-speaker audio scene. Leveraging such a foundational model allows us to inherit its capacity for natural, non-studio audio: background noise, room acoustics, overlapping dialogue, and spontaneous paralinguistic events, while adding multi-speaker control without any per-turn structure. Concretely, reference latents are concatenated into the model’s token sequence and distinguished by lightweight identity-aware positional encodings. However, we identify a critical obstacle to this approach: the \textit{Reference Shortcut}. During training under standard noise schedules, the model can identify the matching reference by acoustic similarity to the noisy target, bypassing the text prompt entirely. We address this with a high-noise-biased timestep distribution that forces the model to rely on the text prompt for speaker assignment. We evaluate ScenA on the CoVoMix2-Dialogue benchmark, showing that it outperforms existing multi-speaker systems on speaker-binding metrics while generating rich conversational audio with overlapping speech, emotional vocalizations, and ambient sound. Our results demonstrate the advantage of using a general-purpose audio model conditioned on a free-form scene description, rather than passing structured dialog scripts through a speech-only pipeline.

Key Contributions

This paper presents research in the following areas:

  • cs.SD
  • cs.AI
  • cs.CV

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.SD.

Authors

  • Michael Finkelson
  • Daniel Segal
  • Eitan Richardson
  • Shahar Armon
  • Nani Goldring
  • Poriya Panet
  • Nir Zabari
  • Benjamin Brazowski
  • Or Patashnik
  • Yoav HaCohen

Paper Information

  • arXiv ID: 2606.19325v1
  • Categories: cs.SD, cs.AI, cs.CV
  • Published: June 17, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »