[Paper] Towards Audio Token Compression in Large Audio Language Models

Published: 2 months ago (November 25, 2025 at 09:00 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2511.20973v1

Overview

Large Audio Language Models (LALMs) have become the go‑to architecture for tasks that blend speech and general audio understanding—think transcription, translation, and audio‑based assistants. The catch? Their attention mechanisms scale quadratically with the number of audio tokens, and raw audio streams generate tokens at a very high rate. This paper tackles that bottleneck by compressing the audio token stream before it reaches the language model, showing that you can slash token counts by up to 3× with only a modest hit to accuracy.

Key Contributions

Token‑level compression pipeline: Introduces unsupervised segmentation and uniform average pooling to reduce the number of audio tokens emitted by the encoder.
Adapter‑based finetuning: Uses low‑rank adapters to recover performance lost during compression, keeping the bulk of the pretrained LALM frozen.
Empirical validation on two downstream tasks: Demonstrates the approach on Automatic Speech Recognition (ASR) and Speech‑to‑Speech Translation (S2ST), both of which are highly sensitive to lexical fidelity.
Scalability gains: Achieves up to a 3× reduction in token count, translating directly into lower memory footprints and faster inference on edge hardware.

Methodology

Audio Encoding → Token Generation
- A pretrained audio encoder (e.g., wav2vec‑2.0 or HuBERT) processes raw waveform and outputs a dense frame‑wise representation.
Compression Stage (pre‑LLM)
- Unsupervised segmentation: Detects natural boundaries (silences, speaker changes, acoustic events) and groups consecutive frames into segments.
- Uniform average pooling: Within each segment, frames are averaged to produce a single “compressed token”. This reduces the sequence length while preserving the overall acoustic gist.
Adapter Finetuning
- Instead of retraining the whole LALM, the authors insert lightweight low‑rank adapters (tiny linear layers) between the encoder output and the LLM input.
- The adapters are trained on task‑specific data (ASR or S2ST) to adapt the compressed token distribution back to the LLM’s expectations.
LLM Decoding
- The compressed token stream, now enriched by adapters, is fed into a large language model (e.g., GPT‑style transformer) that generates text or translated speech tokens.

The pipeline is deliberately modular: you can swap in different encoders, segmentation heuristics, or pooling strategies without touching the massive LLM backbone.

Results & Findings

Task	Baseline (frame‑level)	Compressed (3× fewer tokens)	Relative WER / BLEU loss
ASR	7.8 % WER	8.4 % WER	+0.6 % (≈ 8 % relative)
S2ST	23.1 BLEU	22.5 BLEU	–0.6 BLEU (≈ 3 % relative)

Token reduction: Up to 3× fewer tokens before the LLM, cutting attention‑related memory and compute roughly in half.
Performance trade‑off: The adapter‑finetuned compressed models stay within 1 % absolute WER for ASR and 0.6 BLEU for translation—well within typical production tolerances.
Speedup: Inference latency dropped by ~30 % on a single‑GPU setup; on a low‑power edge accelerator, the gains were even more pronounced due to reduced memory bandwidth.

Practical Implications

Edge deployment: Developers can now run LALM‑style speech interfaces on smartphones, wearables, or IoT devices without needing a full‑scale GPU.
Long‑form audio processing: Podcast transcription, meeting summarization, or continuous listening agents become feasible because the quadratic attention cost no longer explodes with minutes‑long inputs.
Cost‑effective scaling: Cloud providers can serve more concurrent audio streams per GPU, lowering operational expenses for services like real‑time translation or voice assistants.
Plug‑and‑play adapters: Since only a few adapter parameters need finetuning, teams can quickly adapt a compressed LALM to new domains (medical dictation, legal proceedings) with minimal data and compute.

Limitations & Future Work

Segmentation quality: The unsupervised boundary detection can mis‑group rapid speech or overlapping speakers, leading to occasional token‑level information loss.
Adapter capacity: Low‑rank adapters recover most—but not all—of the performance gap; larger adapters improve accuracy but erode the memory savings.
Task scope: Experiments focus on ASR and S2ST; other audio‑centric tasks (sound event detection, music transcription) may react differently to token compression.
Future directions: The authors suggest exploring learnable pooling (e.g., attention‑based downsampling), hierarchical token compression, and joint training of encoder‑adapter‑LLM to further close the performance gap while pushing token reduction beyond 3×.

Authors

Saurabhchand Bhati
Samuel Thomas
Hilde Kuehne
Rogerio Feris
James Glass

Paper Information

arXiv ID: 2511.20973v1
Categories: eess.AS, cs.AI, cs.CL
Published: November 26, 2025
PDF: Download PDF

[Paper] Towards Audio Token Compression in Large Audio Language Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] ThetaEvolve: Test-time Learning on Open Problems

[Paper] MegaChat: A Synthetic Persian Q&A Dataset for High-Quality Sales Chatbot Evaluation

[Paper] Towards Improving Interpretability of Language Model Generation through a Structured Knowledge Discovery Approach

[Paper] Every Token Counts: Generalizing 16M Ultra-Long Context in Large Language Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] ThetaEvolve: Test-time Learning on Open Problems

[Paper] MegaChat: A Synthetic Persian Q&amp;A Dataset for High-Quality Sales Chatbot Evaluation

[Paper] Towards Improving Interpretability of Language Model Generation through a Structured Knowledge Discovery Approach

[Paper] Every Token Counts: Generalizing 16M Ultra-Long Context in Large Language Models

[Paper] MegaChat: A Synthetic Persian Q&A Dataset for High-Quality Sales Chatbot Evaluation