[Paper] Which Speech Representation Better Matches Text-Native Reasoning? A Study of Speech-Text Alignment on Frame Rate and Representation

Published: 3 days ago (June 10, 2026 at 11:19 AM EDT)

2 min read

Source: arXiv

Source: arXiv - 2606.12199v1

Overview

Spoken dialogue models typically start from text LLM backbones, yet reasoning often degrades when conditioning on speech instead of text. We attribute part of this modality gap to a temporal-granularity mismatch: speech tokens are temporally redundant and far longer than text under matched semantics, diluting per-token semantic density and weakening text-native reasoning dynamics. We study speech token design as a representation selection problem and sweep frame rates under a frozen LLM backbone with a fixed information rate. To make low frame rates feasible, we introduce factorized FSQ and a lightweight non-autoregressive audio LM head, scaling capacity to nearly 300,bits/frame without sacrificing efficient prediction. With the bottleneck removed, we sweep frame rates (50$\rightarrow$2.08,Hz) and alignment depth, and observe a consistent best regime for speech QA at 4.17,Hz with intermediate-layer representation alignment.

Key Contributions

This paper presents research in the following areas:

eess.AS
cs.CL
cs.SD

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of eess.AS.

Authors

Zhen Ye
Xu Tan
Yiming Li
Guangyan Zhang
Chimin Chan
Haohe Liu
Zhengxi Liu
Hongzhan Lin
Zheqi Dai
Xinshen Zhang
Peiwen Sun
Qiuqiang Kong
Wei Xue

Paper Information

arXiv ID: 2606.12199v1
Categories: eess.AS, cs.CL, cs.SD
Published: June 10, 2026
PDF: Download PDF

[Paper] Which Speech Representation Better Matches Text-Native Reasoning? A Study of Speech-Text Alignment on Frame Rate and Representation

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments

[Paper] Learning to Reason by Analogy via Retrieval-Augmented Reinforcement Fine-Tuning

[Paper] Influcoder: Distilling Decoders' Gradient Influence Rankings into an Encoder for Data Attribution

[Paper] HyperTool: Beyond Step-Wise Tool Calls for Tool-Augmented Agents