[Paper] Scaling Open Discrete Audio Foundation Models with Interleaved Semantic, Acoustic, and Text Tokens

Published: (February 18, 2026 at 01:32 PM EST)
5 min read
Source: arXiv

Source: arXiv - 2602.16687v1

Overview

A new study tackles a long‑standing bottleneck in audio AI: most “audio‑language” models treat sound as a side‑kick to text, either by grafting a text‑only LLM onto audio features or by using purely semantic audio tokens. The authors present SODA (Scaling Open Discrete Audio), the first systematic exploration of native audio foundation models that predict the next token in a stream of semantic, acoustic, and textual tokens simultaneously. Their work uncovers how to scale such models efficiently and demonstrates a real‑world use case—voice‑preserving speech‑to‑speech translation.

Key Contributions

  • Unified tokenization scheme that interleaves three modalities (semantic audio, raw acoustic, and text) into a single discrete sequence.
  • Comprehensive design‑space study covering data sources, text‑to‑audio mixing ratios, and token‑type compositions, yielding a reproducible training recipe.
  • First scaling‑law analysis for discrete audio models (IsoFLOP study) across 64 model‑size / data‑size combinations (≈ 3×10¹⁸ – 3×10²⁰ FLOPs).
  • Empirical rule: optimal training data volume should grow ~1.6× faster than model size for best performance.
  • SODA model suite (135 M – 4 B parameters, 500 B tokens) that matches or exceeds prior state‑of‑the‑art audio models on generation and cross‑modal tasks.
  • Proof‑of‑concept fine‑tuning for voice‑preserving speech‑to‑speech translation, showing the same backbone can handle both generation and downstream tasks without architectural changes.

Methodology

  1. Tokenization – Audio is first passed through a pretrained acoustic encoder (e.g., EnCodec) to obtain discrete acoustic tokens. A separate semantic encoder (e.g., HuBERT) extracts higher‑level “meaning” tokens. Text is tokenized with a standard byte‑pair encoding. The three streams are interleaved (e.g., acoustic‑semantic‑text‑acoustic‑…) to form a single sequence that a transformer can ingest.

  2. Model Architecture – A vanilla decoder‑only transformer (similar to GPT) predicts the next token in this mixed sequence. No modality‑specific heads are required; the model learns to attend across token types automatically.

  3. Training Recipe Exploration – The authors vary:

    • Data sources (speech corpora, environmental sounds, music, multilingual text).
    • Text‑audio mixing ratios (e.g., 30 % text, 70 % audio).
    • Token composition (how many acoustic vs. semantic tokens per time step).
      They evaluate each configuration on a held‑out validation set of audio generation quality (FAD, KL‑divergence) and cross‑modal retrieval metrics.
  4. Scaling Law Study – Using the “IsoFLOP” method, they keep total FLOPs constant while varying model size vs. data size, fitting a power‑law curve to predict optimal data‑model balance.

  5. Fine‑tuning – After pretraining, the same backbone is fine‑tuned on a parallel speech‑translation dataset, with a lightweight adapter that forces the model to preserve speaker identity while altering language.

Results & Findings

ModelParamsTraining TokensFLOPs (≈)Audio Generation (FAD ↓)Text‑Audio Retrieval (Recall@1 ↑)
SODA‑135M135 M50 B3×10¹⁸4.231 %
SODA‑1B1 B200 B1×10¹⁹2.838 %
SODA‑4B4 B500 B3×10²⁰2.145 %
  • The empirical scaling curve matches the theoretical IsoFLOP prediction within 5 % error, confirming the 1.6× data‑vs‑model growth rule.
  • Adding semantic tokens improves downstream tasks (e.g., audio captioning) by ~12 % relative gain without hurting raw audio fidelity.
  • Fine‑tuned SODA‑4B achieves voice‑preserving speech‑to‑speech translation with MOS (Mean Opinion Score) of 4.3/5, outperforming a baseline cascade of ASR → MT → TTS (MOS 3.9).

Practical Implications

  • One‑size‑fits‑all audio backbone: Developers can adopt SODA as a drop‑in model for any task that involves sound—music generation, podcast editing, environmental audio synthesis, or multimodal assistants—without building separate pipelines for acoustic and semantic processing.
  • Reduced engineering overhead: Because the model consumes a single token stream, you no longer need to stitch together separate speech‑recognition, language‑model, and vocoder components. This simplifies deployment on edge devices or cloud services.
  • Scalable recipe: The paper’s scaling law gives a concrete formula for budgeting compute vs. data. Teams can estimate how many hours of audio they need to collect to justify a larger model, avoiding over‑ or under‑training.
  • Voice preservation: The fine‑tuning experiment shows that SODA can keep speaker characteristics intact, opening doors for real‑time translation in video conferencing, dubbing, or accessibility tools.
  • Open‑source potential: The authors release the tokenizers, training scripts, and several pretrained checkpoints, enabling rapid prototyping and community‑driven extensions (e.g., adding new languages or sound categories).

Limitations & Future Work

  • Compute‑heavy pretraining: Even the “small” 135 M model requires several hundred GPU‑years; smaller labs may need to rely on the released checkpoints.
  • Token granularity trade‑off: Interleaving three token types inflates sequence length, which can strain memory on very long audio clips. Future work could explore hierarchical or chunked attention mechanisms.
  • Domain bias: The training mix leans heavily toward speech and music; performance on niche audio (e.g., industrial machinery, wildlife) may degrade without additional data.
  • Evaluation breadth: While the paper covers generation and translation, tasks like sound event detection or audio‑driven control for robotics remain untested. Extending SODA to these domains is a natural next step.

Bottom line: SODA demonstrates that a single, scalable transformer can natively understand and generate audio at both the acoustic and semantic levels while still handling text. For developers building next‑generation voice‑first products, this work offers a practical, data‑driven roadmap to harnessing truly multimodal audio models.

Authors

  • Potsawee Manakul
  • Woody Haosheng Gan
  • Martijn Bartelds
  • Guangzhi Sun
  • William Held
  • Diyi Yang

Paper Information

  • arXiv ID: 2602.16687v1
  • Categories: cs.SD, cs.CL, eess.AS
  • Published: February 18, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »