[Paper] USAD 2.0: Scaling Representation Distillation for Universal Audio Understanding

Published: (June 4, 2026 at 01:42 PM EDT)
2 min read
Source: arXiv

Source: arXiv - 2606.06444v1

Overview

Audio encoders are critical to modern audio applications as large language models (LLMs) increasingly rely on a single encoder for diverse inputs. While self‑supervised learning (SSL) has yielded strong domain‑specific encoders like speech or music experts, multi‑domain approaches like USAD and SPEAR remain limited in coverage and evaluation. Recent studies also suggest supervised encoders align better with audio LLMs.

We present USAD 2.0, a universal encoder integrating knowledge from both SSL and supervised foundation models. USAD 2.0 introduces domain‑aware distillation to address teacher mismatch, extends coverage to the music domain, and adds second‑stage supervised distillation for downstream use. We further scale the model to one billion parameters via depth scaling. Experiments show USAD 2.0 achieves strong or state‑of‑the‑art performance across probing and LLM‑based evaluations.

Key Contributions

  • eess.AS
  • cs.CL
  • cs.SD

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of eess.AS.

Authors

  • Heng‑Jui Chang
  • Alexander H. Liu
  • Saurabhchand Bhati
  • Mrudula Athi
  • Anton Ratnarajah
  • Amit Chhetri
  • James Glass

Paper Information

  • arXiv ID: 2606.06444v1
  • Categories: eess.AS, cs.CL, cs.SD
  • Published: June 4, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »