[Paper] RosettaSpeech: Zero-Shot Speech-to-Speech Translation from Monolingual Data

Published: (November 25, 2025 at 09:02 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2511.20974v1

Overview

RosettaSpeech tackles one of the biggest bottlenecks in speech‑to‑speech translation (S2ST): the near‑absence of large parallel speech corpora. By training only on monolingual speech‑text data and leveraging existing text‑to‑text machine‑translation (MT) models, the authors build a zero‑shot, end‑to‑end S2ST system that translates directly from source speech to target speech while preserving the speaker’s voice. The result is a simpler pipeline that still hits state‑of‑the‑art performance on widely used benchmarks.

Key Contributions

  • Zero‑shot S2ST framework that requires no parallel speech‑to‑speech data, only monolingual speech‑text pairs plus a text‑based NMT model.
  • Unified end‑to‑end architecture: during inference the model maps source audio straight to target audio, eliminating intermediate text generation and separate TTS modules.
  • Many‑to‑one multilingual capability (French, Spanish, German → English) with a single model, demonstrating scalability across languages.
  • Comprehensive scaling analysis showing how increasing the amount of monolingual speech‑text data improves translation quality.
  • State‑of‑the‑art results on the CVSS‑C benchmark (ASR‑BLEU = 25.17 for DE→EN, 29.86 for ES→EN), outperforming prior multi‑stage pipelines by 14‑27 %.

Methodology

  1. Data Preparation

    • Collect large monolingual corpora of speech‑text pairs for each language (e.g., LibriSpeech, Common Voice).
    • Use a high‑quality text‑to‑text NMT system to generate pseudo‑parallel source‑target text pairs from the monolingual transcripts.
  2. Model Architecture

    • Encoder: a self‑supervised speech encoder (e.g., wav2vec 2.0) converts raw audio into a language‑agnostic latent representation.
    • Cross‑modal bridge: a lightweight transformer aligns the speech latent space with the textual latent space learned by the NMT model.
    • Decoder: a neural vocoder‑style decoder (e.g., HiFi‑GAN) synthesizes target‑language speech directly from the aligned latent vectors, preserving speaker characteristics.
  3. Training Objective

    • Speech‑to‑text loss: the encoder is first fine‑tuned to predict the source transcript (standard ASR loss).
    • Text‑to‑speech loss: the bridge and decoder are trained to reconstruct the NMT‑generated target transcript as speech, using a combination of L1 spectrogram loss and adversarial vocoder loss.
    • The two stages are jointly optimized, but the text only appears as a supervisory signal; it never surfaces at inference time.
  4. Inference

    • Input: raw source audio.
    • Output: synthesized target audio, generated in a single forward pass—no intermediate transcription or separate TTS step.

Results & Findings

Language PairMetric (ASR‑BLEU)Relative Gain vs. Prior SOTA
German → English25.17+27 %
Spanish → English29.86+14 %
French → English (multi‑lang)27.4 (approx.)comparable to dedicated bilingual models
  • Speaker preservation: subjective listening tests reported higher speaker similarity scores than cascaded ASR‑MT‑TTS pipelines.
  • Data scaling: performance improves roughly logarithmically with the amount of monolingual speech‑text data, confirming that the approach can keep getting better as more public speech datasets become available.
  • Single‑model multilingualism: one RosettaSpeech model handled three source languages to English without any language‑specific fine‑tuning, simplifying deployment.

Practical Implications

  • Lower data barrier: Companies can now build S2ST services for low‑resource languages by relying on abundant monolingual speech recordings and existing MT models, sidestepping costly parallel speech collection.
  • Simplified stack: Deploying a single end‑to‑end model reduces latency, memory footprint, and engineering overhead compared to traditional cascaded ASR‑MT‑TTS pipelines.
  • Real‑time voice‑preserving translation: The direct speech‑to‑speech output keeps the speaker’s timbre, opening use‑cases in live conferencing, dubbing, and accessibility tools where voice identity matters.
  • Scalable multilingual products: A single model can be extended to additional source languages by adding more monolingual data, making it attractive for global platforms (e.g., video streaming, customer support).

Limitations & Future Work

  • Dependence on high‑quality text MT: The quality of the pseudo‑parallel text pairs caps the ultimate translation performance; errors in the MT step can propagate to the speech output.
  • Speaker variation handling: While voice preservation is better than cascaded systems, extreme accents or noisy recordings still degrade quality.
  • Evaluation scope: Benchmarks focus on European languages; further testing on truly low‑resource or tonal languages is needed.
  • Future directions suggested by the authors include: integrating self‑training loops to refine the bridge without external MT, exploring multilingual vocoders for many‑to‑many translation, and extending the framework to handle code‑switching or multimodal inputs.

Authors

  • Zhisheng Zheng
  • Xiaohang Sun
  • Tuan Dinh
  • Abhishek Yanamandra
  • Abhinav Jain
  • Zhu Liu
  • Sunil Hadap
  • Vimal Bhat
  • Manoj Aggarwal
  • Gerard Medioni
  • David Harwath

Paper Information

  • arXiv ID: 2511.20974v1
  • Categories: eess.AS, cs.CL, cs.LG
  • Published: November 26, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »