[Paper] How Auditory Knowledge in LLM Backbones Shapes Audio Language Models: A Holistic Evaluation
Source: arXiv - 2603.19195v1
Overview
Large language models (LLMs) have been widely used as knowledge backbones of Large Audio Language Models (LALMs), yet how much auditory knowledge they encode through text-only pre-training and how this affects downstream performance remains unclear. We study this gap by comparing different LLMs under two text-only and one audio-grounded setting: (1) direct probing on AKB-2000, a curated benchmark testing the breadth and depth of auditory knowledge; (2) cascade evaluation, where LLMs reason over text descriptions from an audio captioner; and (3) audio-grounded evaluation, where each LLM is fine-tuned into a Large Audio Language Model (LALM) with an audio encoder. Our findings reveal that auditory knowledge varies substantially across families, and text-only results are strongly correlated with audio performance. Our work provides empirical grounding for a comprehensive understanding of LLMs in audio research.
Key Contributions
This paper presents research in the following areas:
- eess.AS
- cs.CL
- cs.SD
Methodology
Please refer to the full paper for detailed methodology.
Practical Implications
This research contributes to the advancement of eess.AS.
Authors
- Ke-Han Lu
- Szu-Wei Fu
- Chao-Han Huck Yang
- Zhehuai Chen
- Sung-Feng Huang
- Chih-Kai Yang
- Yi-Cheng Lin
- Chi-Yuan Hsiao
- Wenze Ren
- En-Pei Hu
- Yu-Han Huang
- An-Yu Cheng
- Cheng-Han Chiang
- Yu Tsao
- Yu-Chiang Frank Wang
- Hung-yi Lee
Paper Information
- arXiv ID: 2603.19195v1
- Categories: eess.AS, cs.CL, cs.SD
- Published: March 19, 2026
- PDF: Download PDF