[Paper] UniAudio-Token: Empowering Semantic Speech Tokenizers with General Audio Perception

Published: (May 29, 2026 at 12:36 PM EDT)
5 min read
Source: arXiv

Source: arXiv - 2605.31521v1

Overview

The paper introduces UniAudio‑Token, a new universal audio tokenizer that bridges the gap between speech‑focused semantic tokenizers and the broader world of general audio. By augmenting the traditional single‑codebook design with structured acoustic supervision, UniAudio‑Token retains strong linguistic alignment while also capturing non‑speech sounds, vocal timbre, and environmental cues—making it a more versatile front‑end for Audio‑LLMs.

Key Contributions

  • Semantic‑Acoustic Primitives (SAP): A three‑part decomposition (linguistic content, vocal attributes, auditory‑scene primitives) that supplies explicit, structured supervision during training.
  • Semantic‑Acoustic Equilibrium (SAE): A content‑aware gating mechanism that dynamically pulls fine‑grained acoustic details from shallow encoder layers, restoring information lost in the semantic bottleneck.
  • Unified Representation: Demonstrates that a single‑codebook tokenizer can simultaneously excel at speech transcription, speaker/style modeling, and general audio scene understanding.
  • Empirical Superiority: Outperforms all existing single‑codebook baselines on both audio understanding (e.g., classification, retrieval) and generation (e.g., speech synthesis, sound effect synthesis) when paired with downstream LLMs.
  • Open‑Source Release: Full training/inference scripts and pretrained checkpoints are made publicly available, encouraging reproducibility and community extensions.

Methodology

  1. Base Architecture: Starts from a conventional semantic speech tokenizer (single codebook, transformer encoder).
  2. SAP Supervision: During pre‑training, each audio segment is annotated with three primitive targets:
    • Linguistic content (phoneme‑level transcription),
    • Vocal attributes (speaker identity, pitch, emotion),
    • Auditory‑scene primitives (background noises, music, environmental sounds).
      These targets are derived from off‑the‑shelf models (ASR, speaker verification, sound event detectors) and fed to separate heads that guide the encoder.
  3. SAE Gating: A lightweight gating network evaluates the semantic richness of each token. For tokens deemed “speech‑heavy,” the gate suppresses shallow‑layer acoustic features; for “acoustic‑rich” tokens (e.g., music, noise), the gate opens, allowing high‑resolution features from early layers to be merged into the final token embedding.
  4. Training Objective: A combined loss that balances semantic reconstruction (via a VQ‑VAE decoder) with the three primitive prediction losses, ensuring the model learns a compact yet information‑dense representation.
  5. Integration with LLMs: The resulting token stream is fed to a language model (e.g., GPT‑style) that has been fine‑tuned on multimodal tasks, enabling both understanding (classification, retrieval) and generation (text‑to‑audio) through a single interface.

Results & Findings

TaskBaseline (single‑codebook)UniAudio‑TokenRelative Gain
Speech transcription (WER)7.8%6.2%↓20%
Speaker identification (Acc.)84.1%90.3%+7%
Audio event classification (mAP)62.471.8+15%
Text‑to‑audio generation (MOS)3.94.5+15%
Multimodal QA (accuracy)71.2%78.6%+10%
  • Universal Representation: t‑SNE visualizations show that tokens from speech, music, and environmental sounds occupy distinct yet smoothly connected regions, confirming that the model learns a shared latent space.
  • Ablation: Removing SAE drops acoustic‑scene classification mAP by ~8%, while omitting SAP reduces speaker‑style fidelity, highlighting the complementary role of both innovations.
  • Efficiency: Despite the added gating, inference latency increases by <15% compared to the vanilla tokenizer, preserving real‑time applicability.

Practical Implications

  • Unified Audio Front‑End: Developers can replace multiple specialized tokenizers (speech‑only, music‑only, sound‑event) with a single UniAudio‑Token module, simplifying pipeline architecture for voice assistants, podcast editors, and AR/VR audio engines.
  • Better LLM Interaction: Audio‑LLMs equipped with UniAudio‑Token can understand mixed‑modality inputs (e.g., a spoken command over background music) and generate richer outputs (e.g., speech with appropriate ambient sound), opening up more natural human‑computer interactions.
  • Enhanced Personalization: The vocal‑attribute primitive enables fine‑grained speaker or style control without extra conditioning signals, useful for custom voice avatars or adaptive narration.
  • Edge Deployment: The modest latency overhead and single‑codebook footprint make it feasible to run on modern mobile or embedded GPUs, bringing sophisticated audio perception to on‑device applications.
  • Open‑Source Ecosystem: With the released code and checkpoints, teams can fine‑tune UniAudio‑Token on domain‑specific audio (e.g., medical auscultation, industrial monitoring) without rebuilding the entire training stack.

Limitations & Future Work

  • Dependency on External Primitive Labels: SAP relies on pre‑trained ASR, speaker, and sound‑event models; errors in these upstream systems can propagate into the tokenizer.
  • Single‑Codebook Capacity: While SAE mitigates information loss, extremely dense acoustic scenes (e.g., orchestral music) may still exceed the representational bandwidth of a single codebook.
  • Scalability to Very Long Audio: The current transformer encoder handles up to ~30 seconds of audio efficiently; longer sequences would benefit from hierarchical or streaming extensions.
  • Future Directions: The authors suggest exploring multi‑codebook hybrids, self‑supervised primitive discovery (reducing reliance on external annotators), and tighter integration with multimodal LLMs that jointly process video and text.

Authors

  • Yuhan Song
  • Linhao Zhang
  • Aiwei Liu
  • Chuhan Wu
  • Sijun Zhang
  • Wei Jia
  • Yuan Liu
  • Houfeng Wang
  • Xiao Zhou

Paper Information

  • arXiv ID: 2605.31521v1
  • Categories: cs.CL, cs.SD
  • Published: May 29, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »