[Paper] Benchmarking Language Modeling for Lossless Compression of Full-Fidelity Audio

Published: 1 day ago (March 9, 2026 at 01:52 PM EDT)

4 min read

Source: arXiv

Source: arXiv - 2603.08683v1

Overview

The paper investigates whether modern autoregressive “language” models—originally designed for text—can be used to compress raw audio without any loss of quality. While earlier work showed promise on 8‑bit audio, this study pushes the idea to full‑fidelity 16‑ and 24‑bit recordings, covering music, speech, and even bioacoustic signals. The authors introduce a new tokenization scheme that makes high‑bit‑depth audio tractable for neural compression and benchmark the results against industry‑standard codecs like FLAC.

Key Contributions

Trilobyte tokenization: A byte‑level representation that keeps the vocabulary size constant regardless of audio bit depth, enabling efficient modeling of 16‑ and 24‑bit waveforms.
Comprehensive benchmark: Evaluation across multiple domains (music, speech, bioacoustics), sampling rates (16 kHz–48 kHz), and bit depths (8, 16, 24 bit).
State‑of‑the‑art compression: Demonstrates that language‑model‑based compressors consistently beat FLAC at 8‑ and 16‑bit and achieve competitive results at 24‑bit.
Analysis of scaling behavior: Shows how compression gains diminish as bit depth increases, highlighting the limits of current LM architectures for ultra‑high‑resolution audio.

Methodology

Data preparation: Raw audio waveforms are split into fixed‑length segments. For 8‑bit audio, each sample can be treated as a token (0–255). For higher bit depths, the naïve approach would explode the token vocabulary (65 K for 16‑bit, 16.7 M for 24‑bit), making training infeasible.
Trilobyte tokenization: Instead of treating each sample as a single token, the waveform is broken into three consecutive bytes (hence “Trilobyte”). This yields a constant 256‑symbol vocabulary, regardless of the original bit depth, while preserving the full numeric precision when the bytes are recombined during decoding.
Model architecture: Autoregressive transformer‑style language models are trained to predict the next byte given the previous context, analogous to next‑word prediction in text. The models are optimized with cross‑entropy loss, which directly corresponds to the number of bits needed for entropy coding.
Compression pipeline: After training, the model’s probability distribution over the next byte is fed into an arithmetic coder (or range coder) to produce a lossless bitstream. Decompression simply runs the model in inference mode to reconstruct the exact original waveform.
Evaluation: Compression ratios (original size / compressed size) are measured and compared against FLAC, a widely used lossless audio codec. Experiments span diverse datasets to ensure the results are not domain‑specific.

Results & Findings

8‑bit audio: The LM‑based compressor outperforms FLAC by a noticeable margin (≈ 10–15 % better compression ratio) across all tested domains.
16‑bit audio: Gains persist but shrink to around 5–8 % over FLAC, still establishing LM‑based methods as competitive.
24‑bit audio: The Trilobyte tokenization makes training feasible; the resulting compressor matches FLAC’s performance, with only modest improvements (≈ 2–3 %).
Domain robustness: Music, speech, and bioacoustic recordings all exhibit similar trends, indicating the approach is not limited to a single type of audio content.
Scaling insight: As the bit depth grows, the entropy of the raw signal increases faster than the model’s capacity to capture long‑range dependencies, explaining the diminishing returns.

Practical Implications

Next‑gen lossless codecs: Developers of audio streaming platforms or archival services could integrate LM‑based compressors to squeeze extra bandwidth out of 8‑ and 16‑bit legacy content, especially where storage cost is a bottleneck.
Unified pipelines: Since the same model architecture works for music, speech, and scientific recordings, a single service could handle heterogeneous audio streams without switching codecs.
Edge deployment: The byte‑level tokenization keeps the model’s input size modest, making it feasible to run inference on modern GPUs or even specialized inference accelerators at the edge (e.g., on‑device music apps).
Research foundation: The Trilobyte scheme opens the door for future work on neural audio compression at even higher resolutions, potentially combined with perceptual weighting or hybrid coding strategies.

Limitations & Future Work

Diminishing returns at high bit depth: The modest gains for 24‑bit audio suggest current autoregressive models struggle to capture the extra information density; larger or more expressive architectures may be needed.
Inference speed: Autoregressive generation is inherently sequential, leading to slower compression/decompression compared to block‑based codecs like FLAC. Optimizations (e.g., parallel sampling, distillation) are required for real‑time use.
Energy consumption: Running large transformer models can be power‑hungry, which may limit applicability on battery‑constrained devices.
Future directions: The authors propose exploring non‑autoregressive or diffusion‑based models, integrating perceptual loss functions to prioritize audible features, and extending the benchmark to multi‑channel (e.g., surround) audio.

Authors

Phillip Long
Zachary Novack
Chris Donahue

Paper Information

arXiv ID: 2603.08683v1
Categories: cs.SD, cs.AI, cs.LG, eess.AS
Published: March 9, 2026
PDF: Download PDF

[Paper] Benchmarking Language Modeling for Lossless Compression of Full-Fidelity Audio

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Task Aware Modulation Using Representation Learning for Upsaling of Terrestrial Carbon Fluxes

[Paper] From Data Statistics to Feature Geometry: How Correlations Shape Superposition

[Paper] Understanding the Use of a Large Language Model-Powered Guide to Make Virtual Reality Accessible for Blind and Low Vision People

[Paper] Emotional Modulation in Swarm Decision Dynamics