[Paper] AuRA: Internalizing Audio Understanding into LLMs as LoRA

Published: 3 days ago (June 9, 2026 at 12:05 PM EDT)

2 min read

Source: arXiv

Source: arXiv - 2606.11033v1

Overview

Recent efforts to extend large language models (LLMs) to speech inputs typically rely on cascaded ASR-LLM pipelines, end-to-end speech-language models, or bridge/distillation-based adaptation. While these routes respectively reuse strong pretrained components, enable native speech-language interaction, or offer lightweight adaptation, they often suffer from transcript-interface latency, costly multimodal training, or sequential speech-language coupling. To address these limitations, we present AuRA, a method that distills audio encoding capability into the LLM. Specifically, AuRA feeds the same speech input to an ASR encoder (as a teacher) and a LoRA-adapted LLM (as a student) through a lightweight audio embedding layer, and uses layer-wise distillation to align the student’s hidden states with corresponding teacher representations, thereby internalizing speech representations into lightweight LLM-side adaptations. Compared with cascaded and serial bridge methods, AuRA enables tighter speech-language joint modeling and efficient parallel end-to-end inference, while also reusing pretrained speech and language models rather than requiring large-scale multimodal training. On multiple speech-language benchmarks, AuRA consistently outperforms cascaded systems, speech-to-LLM adaptation baselines, and large-scale speech-language and multimodal models in both effectiveness and efficiency.

Key Contributions

This paper presents research in the following areas:

cs.LG
cs.AI
cs.CL

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.LG.

Authors

Bo Cheng
Lei Shi
Zhanyu Ma
Yuan Wu
Jun Xu
Jiuchong Gao
Jinghua Hao
Renqing He

Paper Information

arXiv ID: 2606.11033v1
Categories: cs.LG, cs.AI, cs.CL
Published: June 9, 2026
PDF: Download PDF

[Paper] AuRA: Internalizing Audio Understanding into LLMs as LoRA

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] Learning to Reason by Analogy via Retrieval-Augmented Reinforcement Fine-Tuning

[Paper] EurekAgent: Agent Environment Engineering is All You Need For Autonomous Scientific Discovery

[Paper] Operadic consistency: a label-free signal for compositional reasoning failures in LLMs

[Paper] SkMTEB: Slovak Massive Text Embedding Benchmark and Model Adaptation