[Paper] UniAudio-Token: Empowering Semantic Speech Tokenizers with General Audio Perception

Published: 1 week ago (May 29, 2026 at 12:36 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2605.31521v1

Overview

The paper introduces UniAudio‑Token, a new universal audio tokenizer that bridges the gap between speech‑focused semantic tokenizers and the broader world of general audio. By augmenting the traditional single‑codebook design with structured acoustic supervision, UniAudio‑Token retains strong linguistic alignment while also capturing non‑speech sounds, vocal timbre, and environmental cues—making it a more versatile front‑end for Audio‑LLMs.

Key Contributions

Semantic‑Acoustic Primitives (SAP): A three‑part decomposition (linguistic content, vocal attributes, auditory‑scene primitives) that supplies explicit, structured supervision during training.
Semantic‑Acoustic Equilibrium (SAE): A content‑aware gating mechanism that dynamically pulls fine‑grained acoustic details from shallow encoder layers, restoring information lost in the semantic bottleneck.
Unified Representation: Demonstrates that a single‑codebook tokenizer can simultaneously excel at speech transcription, speaker/style modeling, and general audio scene understanding.
Empirical Superiority: Outperforms all existing single‑codebook baselines on both audio understanding (e.g., classification, retrieval) and generation (e.g., speech synthesis, sound effect synthesis) when paired with downstream LLMs.
Open‑Source Release: Full training/inference scripts and pretrained checkpoints are made publicly available, encouraging reproducibility and community extensions.

Methodology

Base Architecture: Starts from a conventional semantic speech tokenizer (single codebook, transformer encoder).
SAP Supervision: During pre‑training, each audio segment is annotated with three primitive targets:
- Linguistic content (phoneme‑level transcription),
- Vocal attributes (speaker identity, pitch, emotion),
- Auditory‑scene primitives (background noises, music, environmental sounds).
  These targets are derived from off‑the‑shelf models (ASR, speaker verification, sound event detectors) and fed to separate heads that guide the encoder.
SAE Gating: A lightweight gating network evaluates the semantic richness of each token. For tokens deemed “speech‑heavy,” the gate suppresses shallow‑layer acoustic features; for “acoustic‑rich” tokens (e.g., music, noise), the gate opens, allowing high‑resolution features from early layers to be merged into the final token embedding.
Training Objective: A combined loss that balances semantic reconstruction (via a VQ‑VAE decoder) with the three primitive prediction losses, ensuring the model learns a compact yet information‑dense representation.
Integration with LLMs: The resulting token stream is fed to a language model (e.g., GPT‑style) that has been fine‑tuned on multimodal tasks, enabling both understanding (classification, retrieval) and generation (text‑to‑audio) through a single interface.

Results & Findings

Task	Baseline (single‑codebook)	UniAudio‑Token	Relative Gain
Speech transcription (WER)	7.8%	6.2%	↓20%
Speaker identification (Acc.)	84.1%	90.3%	+7%
Audio event classification (mAP)	62.4	71.8	+15%
Text‑to‑audio generation (MOS)	3.9	4.5	+15%
Multimodal QA (accuracy)	71.2%	78.6%	+10%

Universal Representation: t‑SNE visualizations show that tokens from speech, music, and environmental sounds occupy distinct yet smoothly connected regions, confirming that the model learns a shared latent space.
Ablation: Removing SAE drops acoustic‑scene classification mAP by ~8%, while omitting SAP reduces speaker‑style fidelity, highlighting the complementary role of both innovations.
Efficiency: Despite the added gating, inference latency increases by <15% compared to the vanilla tokenizer, preserving real‑time applicability.

Practical Implications

Unified Audio Front‑End: Developers can replace multiple specialized tokenizers (speech‑only, music‑only, sound‑event) with a single UniAudio‑Token module, simplifying pipeline architecture for voice assistants, podcast editors, and AR/VR audio engines.
Better LLM Interaction: Audio‑LLMs equipped with UniAudio‑Token can understand mixed‑modality inputs (e.g., a spoken command over background music) and generate richer outputs (e.g., speech with appropriate ambient sound), opening up more natural human‑computer interactions.
Enhanced Personalization: The vocal‑attribute primitive enables fine‑grained speaker or style control without extra conditioning signals, useful for custom voice avatars or adaptive narration.
Edge Deployment: The modest latency overhead and single‑codebook footprint make it feasible to run on modern mobile or embedded GPUs, bringing sophisticated audio perception to on‑device applications.
Open‑Source Ecosystem: With the released code and checkpoints, teams can fine‑tune UniAudio‑Token on domain‑specific audio (e.g., medical auscultation, industrial monitoring) without rebuilding the entire training stack.

Limitations & Future Work

Dependency on External Primitive Labels: SAP relies on pre‑trained ASR, speaker, and sound‑event models; errors in these upstream systems can propagate into the tokenizer.
Single‑Codebook Capacity: While SAE mitigates information loss, extremely dense acoustic scenes (e.g., orchestral music) may still exceed the representational bandwidth of a single codebook.
Scalability to Very Long Audio: The current transformer encoder handles up to ~30 seconds of audio efficiently; longer sequences would benefit from hierarchical or streaming extensions.
Future Directions: The authors suggest exploring multi‑codebook hybrids, self‑supervised primitive discovery (reducing reliance on external annotators), and tighter integration with multimodal LLMs that jointly process video and text.

Authors

Yuhan Song
Linhao Zhang
Aiwei Liu
Chuhan Wu
Sijun Zhang
Wei Jia
Yuan Liu
Houfeng Wang
Xiao Zhou

Paper Information

arXiv ID: 2605.31521v1
Categories: cs.CL, cs.SD
Published: May 29, 2026
PDF: Download PDF

[Paper] UniAudio-Token: Empowering Semantic Speech Tokenizers with General Audio Perception

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Language Models Learn Constructional Semantics, Not To Mention Syntax: Investigating LM Understanding of Paired-Focus Constructions

[Paper] LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards

[Paper] What Gets Unmasked First? Trajectory Analysis of Diffusion Models for Graph-to-Text Generation

[Paper] Disagreeing Rationales: Rethinking Classification and Explainability Evaluation in Hate Speech Detection