[Paper] Which Speech Representation Better Matches Text-Native Reasoning? A Study of Speech-Text Alignment on Frame Rate and Representation

Published: (June 10, 2026 at 11:19 AM EDT)
2 min read
Source: arXiv

Source: arXiv - 2606.12199v1

Overview

Spoken dialogue models typically start from text LLM backbones, yet reasoning often degrades when conditioning on speech instead of text. We attribute part of this modality gap to a temporal-granularity mismatch: speech tokens are temporally redundant and far longer than text under matched semantics, diluting per-token semantic density and weakening text-native reasoning dynamics. We study speech token design as a representation selection problem and sweep frame rates under a frozen LLM backbone with a fixed information rate. To make low frame rates feasible, we introduce factorized FSQ and a lightweight non-autoregressive audio LM head, scaling capacity to nearly 300,bits/frame without sacrificing efficient prediction. With the bottleneck removed, we sweep frame rates (50$\rightarrow$2.08,Hz) and alignment depth, and observe a consistent best regime for speech QA at 4.17,Hz with intermediate-layer representation alignment.

Key Contributions

This paper presents research in the following areas:

  • eess.AS
  • cs.CL
  • cs.SD

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of eess.AS.

Authors

  • Zhen Ye
  • Xu Tan
  • Yiming Li
  • Guangyan Zhang
  • Chimin Chan
  • Haohe Liu
  • Zhengxi Liu
  • Hongzhan Lin
  • Zheqi Dai
  • Xinshen Zhang
  • Peiwen Sun
  • Qiuqiang Kong
  • Wei Xue

Paper Information

  • arXiv ID: 2606.12199v1
  • Categories: eess.AS, cs.CL, cs.SD
  • Published: June 10, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »