[Paper] KIT's Submission to Cross-Lingual Voice Cloning in IWSLT 2026
Source: arXiv - 2606.07240v1
Overview
Cross-lingual voice cloning aims to generate speech in a target language while preserving speaker identity from a source-language reference. This task is central to speech translation and is the focus of the IWSLT 2026 Cross-Lingual Voice Cloning track. A key challenge is maintaining intelligibility and naturalness in the presence of accent variation and domain-specific vocabulary. We build on a multilingual text-to-speech model, FishAudio-S2-Pro, and introduce language tag prompting to improve language control and reduce accent leakage. We further apply reinforcement learning (RL) fine-tuning for task adaptation and observe improvements in intelligibility. Finally, we propose a reference-conditioned lexical matching method that improves pronunciation of domain-specific terms when lexical overlap is present. Results show that language prompting provides the largest gains, while lexical matching yields consistent improvements on matched subsets.
Key Contributions
This paper presents research in the following areas:
- cs.CL
- cs.SD
Methodology
Please refer to the full paper for detailed methodology.
Practical Implications
This research contributes to the advancement of cs.CL.
Authors
- Seymanur Akti
- Alexander Waibel
Paper Information
- arXiv ID: 2606.07240v1
- Categories: cs.CL, cs.SD
- Published: June 5, 2026
- PDF: Download PDF