[Paper] USAD 2.0: Scaling Representation Distillation for Universal Audio Understanding

Published: 6 days ago (June 4, 2026 at 01:42 PM EDT)

2 min read

Source: arXiv

Source: arXiv - 2606.06444v1

Overview

Audio encoders are critical to modern audio applications as large language models (LLMs) increasingly rely on a single encoder for diverse inputs. While self‑supervised learning (SSL) has yielded strong domain‑specific encoders like speech or music experts, multi‑domain approaches like USAD and SPEAR remain limited in coverage and evaluation. Recent studies also suggest supervised encoders align better with audio LLMs.

We present USAD 2.0, a universal encoder integrating knowledge from both SSL and supervised foundation models. USAD 2.0 introduces domain‑aware distillation to address teacher mismatch, extends coverage to the music domain, and adds second‑stage supervised distillation for downstream use. We further scale the model to one billion parameters via depth scaling. Experiments show USAD 2.0 achieves strong or state‑of‑the‑art performance across probing and LLM‑based evaluations.

Key Contributions

eess.AS
cs.CL
cs.SD

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of eess.AS.

Authors

Heng‑Jui Chang
Alexander H. Liu
Saurabhchand Bhati
Mrudula Athi
Anton Ratnarajah
Amit Chhetri
James Glass

Paper Information

arXiv ID: 2606.06444v1
Categories: eess.AS, cs.CL, cs.SD
Published: June 4, 2026
PDF: Download PDF

[Paper] USAD 2.0: Scaling Representation Distillation for Universal Audio Understanding

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] How reliable are LLMs when it comes to playing dice?

[Paper] Agentopia: Long-Term Life Simulation and Learning in Agent Societies

[Paper] MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

[Paper] Your UnEmbedding Matrix is Secretly a Feature Lens for Text Embeddings