[Paper] RosettaSpeech: Zero-Shot Speech-to-Speech Translation from Monolingual Data
Source: arXiv - 2511.20974v1
Overview
RosettaSpeech tackles one of the biggest bottlenecks in speech‑to‑speech translation (S2ST): the near‑absence of large parallel speech corpora. By training only on monolingual speech‑text data and leveraging existing text‑to‑text machine‑translation (MT) models, the authors build a zero‑shot, end‑to‑end S2ST system that translates directly from source speech to target speech while preserving the speaker’s voice. The result is a simpler pipeline that still hits state‑of‑the‑art performance on widely used benchmarks.
Key Contributions
- Zero‑shot S2ST framework that requires no parallel speech‑to‑speech data, only monolingual speech‑text pairs plus a text‑based NMT model.
- Unified end‑to‑end architecture: during inference the model maps source audio straight to target audio, eliminating intermediate text generation and separate TTS modules.
- Many‑to‑one multilingual capability (French, Spanish, German → English) with a single model, demonstrating scalability across languages.
- Comprehensive scaling analysis showing how increasing the amount of monolingual speech‑text data improves translation quality.
- State‑of‑the‑art results on the CVSS‑C benchmark (ASR‑BLEU = 25.17 for DE→EN, 29.86 for ES→EN), outperforming prior multi‑stage pipelines by 14‑27 %.
Methodology
-
Data Preparation
- Collect large monolingual corpora of speech‑text pairs for each language (e.g., LibriSpeech, Common Voice).
- Use a high‑quality text‑to‑text NMT system to generate pseudo‑parallel source‑target text pairs from the monolingual transcripts.
-
Model Architecture
- Encoder: a self‑supervised speech encoder (e.g., wav2vec 2.0) converts raw audio into a language‑agnostic latent representation.
- Cross‑modal bridge: a lightweight transformer aligns the speech latent space with the textual latent space learned by the NMT model.
- Decoder: a neural vocoder‑style decoder (e.g., HiFi‑GAN) synthesizes target‑language speech directly from the aligned latent vectors, preserving speaker characteristics.
-
Training Objective
- Speech‑to‑text loss: the encoder is first fine‑tuned to predict the source transcript (standard ASR loss).
- Text‑to‑speech loss: the bridge and decoder are trained to reconstruct the NMT‑generated target transcript as speech, using a combination of L1 spectrogram loss and adversarial vocoder loss.
- The two stages are jointly optimized, but the text only appears as a supervisory signal; it never surfaces at inference time.
-
Inference
- Input: raw source audio.
- Output: synthesized target audio, generated in a single forward pass—no intermediate transcription or separate TTS step.
Results & Findings
| Language Pair | Metric (ASR‑BLEU) | Relative Gain vs. Prior SOTA |
|---|---|---|
| German → English | 25.17 | +27 % |
| Spanish → English | 29.86 | +14 % |
| French → English (multi‑lang) | 27.4 (approx.) | comparable to dedicated bilingual models |
- Speaker preservation: subjective listening tests reported higher speaker similarity scores than cascaded ASR‑MT‑TTS pipelines.
- Data scaling: performance improves roughly logarithmically with the amount of monolingual speech‑text data, confirming that the approach can keep getting better as more public speech datasets become available.
- Single‑model multilingualism: one RosettaSpeech model handled three source languages to English without any language‑specific fine‑tuning, simplifying deployment.
Practical Implications
- Lower data barrier: Companies can now build S2ST services for low‑resource languages by relying on abundant monolingual speech recordings and existing MT models, sidestepping costly parallel speech collection.
- Simplified stack: Deploying a single end‑to‑end model reduces latency, memory footprint, and engineering overhead compared to traditional cascaded ASR‑MT‑TTS pipelines.
- Real‑time voice‑preserving translation: The direct speech‑to‑speech output keeps the speaker’s timbre, opening use‑cases in live conferencing, dubbing, and accessibility tools where voice identity matters.
- Scalable multilingual products: A single model can be extended to additional source languages by adding more monolingual data, making it attractive for global platforms (e.g., video streaming, customer support).
Limitations & Future Work
- Dependence on high‑quality text MT: The quality of the pseudo‑parallel text pairs caps the ultimate translation performance; errors in the MT step can propagate to the speech output.
- Speaker variation handling: While voice preservation is better than cascaded systems, extreme accents or noisy recordings still degrade quality.
- Evaluation scope: Benchmarks focus on European languages; further testing on truly low‑resource or tonal languages is needed.
- Future directions suggested by the authors include: integrating self‑training loops to refine the bridge without external MT, exploring multilingual vocoders for many‑to‑many translation, and extending the framework to handle code‑switching or multimodal inputs.
Authors
- Zhisheng Zheng
- Xiaohang Sun
- Tuan Dinh
- Abhishek Yanamandra
- Abhinav Jain
- Zhu Liu
- Sunil Hadap
- Vimal Bhat
- Manoj Aggarwal
- Gerard Medioni
- David Harwath
Paper Information
- arXiv ID: 2511.20974v1
- Categories: eess.AS, cs.CL, cs.LG
- Published: November 26, 2025
- PDF: Download PDF