[Paper] KIT's Submission to Cross-Lingual Voice Cloning in IWSLT 2026

Published: 5 days ago (June 5, 2026 at 09:09 AM EDT)

1 min read

Source: arXiv

Source: arXiv - 2606.07240v1

Overview

Cross-lingual voice cloning aims to generate speech in a target language while preserving speaker identity from a source-language reference. This task is central to speech translation and is the focus of the IWSLT 2026 Cross-Lingual Voice Cloning track. A key challenge is maintaining intelligibility and naturalness in the presence of accent variation and domain-specific vocabulary. We build on a multilingual text-to-speech model, FishAudio-S2-Pro, and introduce language tag prompting to improve language control and reduce accent leakage. We further apply reinforcement learning (RL) fine-tuning for task adaptation and observe improvements in intelligibility. Finally, we propose a reference-conditioned lexical matching method that improves pronunciation of domain-specific terms when lexical overlap is present. Results show that language prompting provides the largest gains, while lexical matching yields consistent improvements on matched subsets.

Key Contributions

This paper presents research in the following areas:

cs.CL
cs.SD

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.CL.

Authors

Seymanur Akti
Alexander Waibel

Paper Information

arXiv ID: 2606.07240v1
Categories: cs.CL, cs.SD
Published: June 5, 2026
PDF: Download PDF

[Paper] KIT's Submission to Cross-Lingual Voice Cloning in IWSLT 2026

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] How reliable are LLMs when it comes to playing dice?

[Paper] Agentopia: Long-Term Life Simulation and Learning in Agent Societies

[Paper] MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

[Paper] Your UnEmbedding Matrix is Secretly a Feature Lens for Text Embeddings