[Paper] Benchmarking Language Modeling for Lossless Compression of Full-Fidelity Audio

Published: (March 9, 2026 at 01:52 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2603.08683v1

Overview

The paper investigates whether modern autoregressive “language” models—originally designed for text—can be used to compress raw audio without any loss of quality. While earlier work showed promise on 8‑bit audio, this study pushes the idea to full‑fidelity 16‑ and 24‑bit recordings, covering music, speech, and even bioacoustic signals. The authors introduce a new tokenization scheme that makes high‑bit‑depth audio tractable for neural compression and benchmark the results against industry‑standard codecs like FLAC.

Key Contributions

  • Trilobyte tokenization: A byte‑level representation that keeps the vocabulary size constant regardless of audio bit depth, enabling efficient modeling of 16‑ and 24‑bit waveforms.
  • Comprehensive benchmark: Evaluation across multiple domains (music, speech, bioacoustics), sampling rates (16 kHz–48 kHz), and bit depths (8, 16, 24 bit).
  • State‑of‑the‑art compression: Demonstrates that language‑model‑based compressors consistently beat FLAC at 8‑ and 16‑bit and achieve competitive results at 24‑bit.
  • Analysis of scaling behavior: Shows how compression gains diminish as bit depth increases, highlighting the limits of current LM architectures for ultra‑high‑resolution audio.

Methodology

  1. Data preparation: Raw audio waveforms are split into fixed‑length segments. For 8‑bit audio, each sample can be treated as a token (0–255). For higher bit depths, the naïve approach would explode the token vocabulary (65 K for 16‑bit, 16.7 M for 24‑bit), making training infeasible.
  2. Trilobyte tokenization: Instead of treating each sample as a single token, the waveform is broken into three consecutive bytes (hence “Trilobyte”). This yields a constant 256‑symbol vocabulary, regardless of the original bit depth, while preserving the full numeric precision when the bytes are recombined during decoding.
  3. Model architecture: Autoregressive transformer‑style language models are trained to predict the next byte given the previous context, analogous to next‑word prediction in text. The models are optimized with cross‑entropy loss, which directly corresponds to the number of bits needed for entropy coding.
  4. Compression pipeline: After training, the model’s probability distribution over the next byte is fed into an arithmetic coder (or range coder) to produce a lossless bitstream. Decompression simply runs the model in inference mode to reconstruct the exact original waveform.
  5. Evaluation: Compression ratios (original size / compressed size) are measured and compared against FLAC, a widely used lossless audio codec. Experiments span diverse datasets to ensure the results are not domain‑specific.

Results & Findings

  • 8‑bit audio: The LM‑based compressor outperforms FLAC by a noticeable margin (≈ 10–15 % better compression ratio) across all tested domains.
  • 16‑bit audio: Gains persist but shrink to around 5–8 % over FLAC, still establishing LM‑based methods as competitive.
  • 24‑bit audio: The Trilobyte tokenization makes training feasible; the resulting compressor matches FLAC’s performance, with only modest improvements (≈ 2–3 %).
  • Domain robustness: Music, speech, and bioacoustic recordings all exhibit similar trends, indicating the approach is not limited to a single type of audio content.
  • Scaling insight: As the bit depth grows, the entropy of the raw signal increases faster than the model’s capacity to capture long‑range dependencies, explaining the diminishing returns.

Practical Implications

  • Next‑gen lossless codecs: Developers of audio streaming platforms or archival services could integrate LM‑based compressors to squeeze extra bandwidth out of 8‑ and 16‑bit legacy content, especially where storage cost is a bottleneck.
  • Unified pipelines: Since the same model architecture works for music, speech, and scientific recordings, a single service could handle heterogeneous audio streams without switching codecs.
  • Edge deployment: The byte‑level tokenization keeps the model’s input size modest, making it feasible to run inference on modern GPUs or even specialized inference accelerators at the edge (e.g., on‑device music apps).
  • Research foundation: The Trilobyte scheme opens the door for future work on neural audio compression at even higher resolutions, potentially combined with perceptual weighting or hybrid coding strategies.

Limitations & Future Work

  • Diminishing returns at high bit depth: The modest gains for 24‑bit audio suggest current autoregressive models struggle to capture the extra information density; larger or more expressive architectures may be needed.
  • Inference speed: Autoregressive generation is inherently sequential, leading to slower compression/decompression compared to block‑based codecs like FLAC. Optimizations (e.g., parallel sampling, distillation) are required for real‑time use.
  • Energy consumption: Running large transformer models can be power‑hungry, which may limit applicability on battery‑constrained devices.
  • Future directions: The authors propose exploring non‑autoregressive or diffusion‑based models, integrating perceptual loss functions to prioritize audible features, and extending the benchmark to multi‑channel (e.g., surround) audio.

Authors

  • Phillip Long
  • Zachary Novack
  • Chris Donahue

Paper Information

  • arXiv ID: 2603.08683v1
  • Categories: cs.SD, cs.AI, cs.LG, eess.AS
  • Published: March 9, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »