[Paper] RosettaSpeech: Zero-Shot Speech-to-Speech Translation from Monolingual Data

Published: 2 months ago (November 25, 2025 at 09:02 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2511.20974v1

Overview

RosettaSpeech tackles one of the biggest bottlenecks in speech‑to‑speech translation (S2ST): the near‑absence of large parallel speech corpora. By training only on monolingual speech‑text data and leveraging existing text‑to‑text machine‑translation (MT) models, the authors build a zero‑shot, end‑to‑end S2ST system that translates directly from source speech to target speech while preserving the speaker’s voice. The result is a simpler pipeline that still hits state‑of‑the‑art performance on widely used benchmarks.

Key Contributions

Zero‑shot S2ST framework that requires no parallel speech‑to‑speech data, only monolingual speech‑text pairs plus a text‑based NMT model.
Unified end‑to‑end architecture: during inference the model maps source audio straight to target audio, eliminating intermediate text generation and separate TTS modules.
Many‑to‑one multilingual capability (French, Spanish, German → English) with a single model, demonstrating scalability across languages.
Comprehensive scaling analysis showing how increasing the amount of monolingual speech‑text data improves translation quality.
State‑of‑the‑art results on the CVSS‑C benchmark (ASR‑BLEU = 25.17 for DE→EN, 29.86 for ES→EN), outperforming prior multi‑stage pipelines by 14‑27 %.

Methodology

Data Preparation
- Collect large monolingual corpora of speech‑text pairs for each language (e.g., LibriSpeech, Common Voice).
- Use a high‑quality text‑to‑text NMT system to generate pseudo‑parallel source‑target text pairs from the monolingual transcripts.
Model Architecture
- Encoder: a self‑supervised speech encoder (e.g., wav2vec 2.0) converts raw audio into a language‑agnostic latent representation.
- Cross‑modal bridge: a lightweight transformer aligns the speech latent space with the textual latent space learned by the NMT model.
- Decoder: a neural vocoder‑style decoder (e.g., HiFi‑GAN) synthesizes target‑language speech directly from the aligned latent vectors, preserving speaker characteristics.
Training Objective
- Speech‑to‑text loss: the encoder is first fine‑tuned to predict the source transcript (standard ASR loss).
- Text‑to‑speech loss: the bridge and decoder are trained to reconstruct the NMT‑generated target transcript as speech, using a combination of L1 spectrogram loss and adversarial vocoder loss.
- The two stages are jointly optimized, but the text only appears as a supervisory signal; it never surfaces at inference time.
Inference
- Input: raw source audio.
- Output: synthesized target audio, generated in a single forward pass—no intermediate transcription or separate TTS step.

Results & Findings

Language Pair	Metric (ASR‑BLEU)	Relative Gain vs. Prior SOTA
German → English	25.17	+27 %
Spanish → English	29.86	+14 %
French → English (multi‑lang)	27.4 (approx.)	comparable to dedicated bilingual models

Speaker preservation: subjective listening tests reported higher speaker similarity scores than cascaded ASR‑MT‑TTS pipelines.
Data scaling: performance improves roughly logarithmically with the amount of monolingual speech‑text data, confirming that the approach can keep getting better as more public speech datasets become available.
Single‑model multilingualism: one RosettaSpeech model handled three source languages to English without any language‑specific fine‑tuning, simplifying deployment.

Practical Implications

Lower data barrier: Companies can now build S2ST services for low‑resource languages by relying on abundant monolingual speech recordings and existing MT models, sidestepping costly parallel speech collection.
Simplified stack: Deploying a single end‑to‑end model reduces latency, memory footprint, and engineering overhead compared to traditional cascaded ASR‑MT‑TTS pipelines.
Real‑time voice‑preserving translation: The direct speech‑to‑speech output keeps the speaker’s timbre, opening use‑cases in live conferencing, dubbing, and accessibility tools where voice identity matters.
Scalable multilingual products: A single model can be extended to additional source languages by adding more monolingual data, making it attractive for global platforms (e.g., video streaming, customer support).

Limitations & Future Work

Dependence on high‑quality text MT: The quality of the pseudo‑parallel text pairs caps the ultimate translation performance; errors in the MT step can propagate to the speech output.
Speaker variation handling: While voice preservation is better than cascaded systems, extreme accents or noisy recordings still degrade quality.
Evaluation scope: Benchmarks focus on European languages; further testing on truly low‑resource or tonal languages is needed.
Future directions suggested by the authors include: integrating self‑training loops to refine the bridge without external MT, exploring multilingual vocoders for many‑to‑many translation, and extending the framework to handle code‑switching or multimodal inputs.

Authors

Zhisheng Zheng
Xiaohang Sun
Tuan Dinh
Abhishek Yanamandra
Abhinav Jain
Zhu Liu
Sunil Hadap
Vimal Bhat
Manoj Aggarwal
Gerard Medioni
David Harwath

Paper Information

arXiv ID: 2511.20974v1
Categories: eess.AS, cs.CL, cs.LG
Published: November 26, 2025
PDF: Download PDF

[Paper] RosettaSpeech: Zero-Shot Speech-to-Speech Translation from Monolingual Data

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] ThetaEvolve: Test-time Learning on Open Problems

[Paper] MegaChat: A Synthetic Persian Q&A Dataset for High-Quality Sales Chatbot Evaluation

[Paper] Towards Improving Interpretability of Language Model Generation through a Structured Knowledge Discovery Approach

[Paper] Every Token Counts: Generalizing 16M Ultra-Long Context in Large Language Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] ThetaEvolve: Test-time Learning on Open Problems

[Paper] MegaChat: A Synthetic Persian Q&amp;A Dataset for High-Quality Sales Chatbot Evaluation

[Paper] Towards Improving Interpretability of Language Model Generation through a Structured Knowledge Discovery Approach

[Paper] Every Token Counts: Generalizing 16M Ultra-Long Context in Large Language Models

[Paper] MegaChat: A Synthetic Persian Q&A Dataset for High-Quality Sales Chatbot Evaluation