[Paper] Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models

Published: (June 9, 2026 at 01:46 PM EDT)
2 min read
Source: arXiv

Source: arXiv - 2606.11167v1

Overview

Full-duplex spoken dialogue models can listen and speak simultaneously, making them a promising architecture for natural conversation. However, current models are trained solely with supervised learning through token-level likelihood maximization, which does not directly optimize interaction-level behaviors, causing interactivity issues such as excessive silence and ill-timed turn-taking. Recent work has applied reinforcement learning (RL) to improve interactivity, but existing methods address only a limited set of interactive behaviors in their rewards. In this work, we propose a post-training alignment method that comprehensively improves the interactivity of full-duplex spoken dialogue models through RL. We address the four canonical axes of interactivity: pause handling, turn-taking, backchanneling, and user interruption. For each axis, we extract short audio segments from human conversation corpora and optimize the model with axis-specific reward functions. An extra LLM-based reward for response quality prevents semantic degradation. We apply our method to two open-source models, Moshi and PersonaPlex, demonstrating consistent improvements in interactivity on both offline evaluation with pre-recorded audio and real-time multi-turn dialogue evaluation.

Key Contributions

This paper presents research in the following areas:

  • cs.CL
  • eess.AS

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.CL.

Authors

  • Atsumoto Ohashi
  • Neil Zeghidour
  • Alexandre Défossez
  • Eugene Kharitonov

Paper Information

  • arXiv ID: 2606.11167v1
  • Categories: cs.CL, eess.AS
  • Published: June 9, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »