[Paper] Towards Audio Token Compression in Large Audio Language Models
Source: arXiv - 2511.20973v1
Overview
Large Audio Language Models (LALMs) have become the go‑to architecture for tasks that blend speech and general audio understanding—think transcription, translation, and audio‑based assistants. The catch? Their attention mechanisms scale quadratically with the number of audio tokens, and raw audio streams generate tokens at a very high rate. This paper tackles that bottleneck by compressing the audio token stream before it reaches the language model, showing that you can slash token counts by up to 3× with only a modest hit to accuracy.
Key Contributions
- Token‑level compression pipeline: Introduces unsupervised segmentation and uniform average pooling to reduce the number of audio tokens emitted by the encoder.
- Adapter‑based finetuning: Uses low‑rank adapters to recover performance lost during compression, keeping the bulk of the pretrained LALM frozen.
- Empirical validation on two downstream tasks: Demonstrates the approach on Automatic Speech Recognition (ASR) and Speech‑to‑Speech Translation (S2ST), both of which are highly sensitive to lexical fidelity.
- Scalability gains: Achieves up to a 3× reduction in token count, translating directly into lower memory footprints and faster inference on edge hardware.
Methodology
- Audio Encoding → Token Generation
- A pretrained audio encoder (e.g., wav2vec‑2.0 or HuBERT) processes raw waveform and outputs a dense frame‑wise representation.
- Compression Stage (pre‑LLM)
- Unsupervised segmentation: Detects natural boundaries (silences, speaker changes, acoustic events) and groups consecutive frames into segments.
- Uniform average pooling: Within each segment, frames are averaged to produce a single “compressed token”. This reduces the sequence length while preserving the overall acoustic gist.
- Adapter Finetuning
- Instead of retraining the whole LALM, the authors insert lightweight low‑rank adapters (tiny linear layers) between the encoder output and the LLM input.
- The adapters are trained on task‑specific data (ASR or S2ST) to adapt the compressed token distribution back to the LLM’s expectations.
- LLM Decoding
- The compressed token stream, now enriched by adapters, is fed into a large language model (e.g., GPT‑style transformer) that generates text or translated speech tokens.
The pipeline is deliberately modular: you can swap in different encoders, segmentation heuristics, or pooling strategies without touching the massive LLM backbone.
Results & Findings
| Task | Baseline (frame‑level) | Compressed (3× fewer tokens) | Relative WER / BLEU loss |
|---|---|---|---|
| ASR | 7.8 % WER | 8.4 % WER | +0.6 % (≈ 8 % relative) |
| S2ST | 23.1 BLEU | 22.5 BLEU | –0.6 BLEU (≈ 3 % relative) |
- Token reduction: Up to 3× fewer tokens before the LLM, cutting attention‑related memory and compute roughly in half.
- Performance trade‑off: The adapter‑finetuned compressed models stay within 1 % absolute WER for ASR and 0.6 BLEU for translation—well within typical production tolerances.
- Speedup: Inference latency dropped by ~30 % on a single‑GPU setup; on a low‑power edge accelerator, the gains were even more pronounced due to reduced memory bandwidth.
Practical Implications
- Edge deployment: Developers can now run LALM‑style speech interfaces on smartphones, wearables, or IoT devices without needing a full‑scale GPU.
- Long‑form audio processing: Podcast transcription, meeting summarization, or continuous listening agents become feasible because the quadratic attention cost no longer explodes with minutes‑long inputs.
- Cost‑effective scaling: Cloud providers can serve more concurrent audio streams per GPU, lowering operational expenses for services like real‑time translation or voice assistants.
- Plug‑and‑play adapters: Since only a few adapter parameters need finetuning, teams can quickly adapt a compressed LALM to new domains (medical dictation, legal proceedings) with minimal data and compute.
Limitations & Future Work
- Segmentation quality: The unsupervised boundary detection can mis‑group rapid speech or overlapping speakers, leading to occasional token‑level information loss.
- Adapter capacity: Low‑rank adapters recover most—but not all—of the performance gap; larger adapters improve accuracy but erode the memory savings.
- Task scope: Experiments focus on ASR and S2ST; other audio‑centric tasks (sound event detection, music transcription) may react differently to token compression.
- Future directions: The authors suggest exploring learnable pooling (e.g., attention‑based downsampling), hierarchical token compression, and joint training of encoder‑adapter‑LLM to further close the performance gap while pushing token reduction beyond 3×.
Authors
- Saurabhchand Bhati
- Samuel Thomas
- Hilde Kuehne
- Rogerio Feris
- James Glass
Paper Information
- arXiv ID: 2511.20973v1
- Categories: eess.AS, cs.AI, cs.CL
- Published: November 26, 2025
- PDF: Download PDF