[Paper] How Auditory Knowledge in LLM Backbones Shapes Audio Language Models: A Holistic Evaluation

Published: 22 hours ago (March 19, 2026 at 01:50 PM EDT)

2 min read

Source: arXiv

Source: arXiv - 2603.19195v1

Overview

Large language models (LLMs) have been widely used as knowledge backbones of Large Audio Language Models (LALMs), yet how much auditory knowledge they encode through text-only pre-training and how this affects downstream performance remains unclear. We study this gap by comparing different LLMs under two text-only and one audio-grounded setting: (1) direct probing on AKB-2000, a curated benchmark testing the breadth and depth of auditory knowledge; (2) cascade evaluation, where LLMs reason over text descriptions from an audio captioner; and (3) audio-grounded evaluation, where each LLM is fine-tuned into a Large Audio Language Model (LALM) with an audio encoder. Our findings reveal that auditory knowledge varies substantially across families, and text-only results are strongly correlated with audio performance. Our work provides empirical grounding for a comprehensive understanding of LLMs in audio research.

Key Contributions

This paper presents research in the following areas:

eess.AS
cs.CL
cs.SD

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of eess.AS.

Authors

Ke-Han Lu
Szu-Wei Fu
Chao-Han Huck Yang
Zhehuai Chen
Sung-Feng Huang
Chih-Kai Yang
Yi-Cheng Lin
Chi-Yuan Hsiao
Wenze Ren
En-Pei Hu
Yu-Han Huang
An-Yu Cheng
Cheng-Han Chiang
Yu Tsao
Yu-Chiang Frank Wang
Hung-yi Lee

Paper Information

arXiv ID: 2603.19195v1
Categories: eess.AS, cs.CL, cs.SD
Published: March 19, 2026
PDF: Download PDF

[Paper] How Auditory Knowledge in LLM Backbones Shapes Audio Language Models: A Holistic Evaluation

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] FinTradeBench: A Financial Reasoning Benchmark for LLMs

[Paper] F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World

[Paper] Online Learning and Equilibrium Computation with Ranking Feedback

[Paper] Nemotron-Cascade 2: Post-Training LLMs with Cascade RL and Multi-Domain On-Policy Distillation