[Paper] KIT's Submission to Cross-Lingual Voice Cloning in IWSLT 2026

Published: (June 5, 2026 at 09:09 AM EDT)
1 min read
Source: arXiv

Source: arXiv - 2606.07240v1

Overview

Cross-lingual voice cloning aims to generate speech in a target language while preserving speaker identity from a source-language reference. This task is central to speech translation and is the focus of the IWSLT 2026 Cross-Lingual Voice Cloning track. A key challenge is maintaining intelligibility and naturalness in the presence of accent variation and domain-specific vocabulary. We build on a multilingual text-to-speech model, FishAudio-S2-Pro, and introduce language tag prompting to improve language control and reduce accent leakage. We further apply reinforcement learning (RL) fine-tuning for task adaptation and observe improvements in intelligibility. Finally, we propose a reference-conditioned lexical matching method that improves pronunciation of domain-specific terms when lexical overlap is present. Results show that language prompting provides the largest gains, while lexical matching yields consistent improvements on matched subsets.

Key Contributions

This paper presents research in the following areas:

  • cs.CL
  • cs.SD

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.CL.

Authors

  • Seymanur Akti
  • Alexander Waibel

Paper Information

  • arXiv ID: 2606.07240v1
  • Categories: cs.CL, cs.SD
  • Published: June 5, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »