Stable Audio 3

Published: 3 weeks ago (May 20, 2026 at 11:10 AM EDT)

2 min read

Source: Hacker News

Abstract

Stable Audio 3 is a family of fast latent diffusion models (small, medium, large) for variable‑length audio generation and editing. Since our models can generate several minutes of audio, variable‑length generations are key to avoid the cost of producing full‑length generations for short sounds. We also support inpainting, enabling targeted audio editing and the continuation of short recordings. Our latent diffusion models operate on top of a novel semantic‑acoustic autoencoder that projects audio into a compact latent space, enabling efficient diffusion‑based generation while preserving audio fidelity and encouraging semantic structure in the latent. Finally, we run adversarial post‑training to both accelerate inference and improve generation quality, reducing the number of inference steps while improving fidelity and prompt adherence. Stable Audio 3 models are trained on licensed and Creative Commons data to generate music and sounds in less than 2 s on an H200 GPU and less than a few seconds on a MacBook Pro M4. We release the weights of small and medium, which can run on consumer‑grade hardware, together with their training and inference pipeline.

Resources

View PDF
HTML (experimental)
Training code: GitHub
Inference and weights: GitHub

Subjects

Sound (cs.SD)
Artificial Intelligence (cs.AI)

Citation

arXiv: 2605.17991 (cs.SD)
DOI:

Submission history

v1 – Mon, 18 May 2026 07:47:03 UTC (67 KB) – submitted by Jordi Pons (view email)

Stable Audio 3

Abstract

Resources

Subjects

Citation

Submission history

Related posts

AI 'Crashes the Party' at This Year's Cannes Film Festival - Including Multi-Year Meta Partnership

AI video is moving beyond clip slop

Vibe coding is coming to your phone

Nobel laureate Olga Tokarczuk apparently used AI to write her latest novel