Stability AI brings new clarity and power to gen AI audio with Stable Audio 2.0

Join us in Atlanta on April 10th and explore the landscape of security workforce. We will explore the vision, benefits, and use cases of AI for security teams. Request an invite here.


Stability AI is continuing to push forward its vision for generative AI with the Stable Audio 2.0 audio model today.

Stability AI is perhaps best known for its text-to-image Stable Diffusion models, but that’s only one of many models the company has been working on. Stable Audio had its initial release in Sept. 2023, introducing the ability for users to generate short audio clips with a simple text prompt. With Stable Audio 2.0, users can generate high-quality audio tracks of up to 3 minutes, double the 90 seconds the initial Stable Audio release enabled.

In addition to supporting text-to-audio, Stable Audio 2.0 will also support audio-to-audio generation, where users upload a sample they want to use as a prompt. Stability AI is making Stable Audio available for limited use for free on the Stable Audio website, with API access available soon so developers can build services.

The new Stable Audio 2.0 release is the first major model drop from Stability AI since the company’s former CEO and founder Emad Mostaque abruptly resigned at the end of March. According to the company, it’s still very much business as usual and the Stable Audio 2.0 update is a testament to that.

VB Event

The AI Impact Tour – Atlanta

Continuing our tour, we’re headed to Atlanta for the AI Impact Tour stop on April 10th. This exclusive, invite-only event, in partnership with Microsoft, will feature discussions on how generative AI is transforming the security workforce. Space is limited, so request an invite today.

Request an invite

Lessons learned from Stable Audio 1.0 informed version 2.0

Stability AI iterated on its initial experience of developing Stable Audio in 2023.

Zach Evans, head of audio research at Stability AI told VentureBeat that for the initial release of Stable Audio 1.0, the focus was on launching a groundbreaking text-to-audio generative model with exceptional audio fidelity and a meaningful output duration. 

“Since the initial release, we have dedicated ourselves to advancing its musicality, extending the output duration, and honing its ability to respond accurately to detailed prompts,” Evans said. “These improvements are aimed at optimizing the technology for practical, real-world applications.”

Stable Audio 2.0 introduces the ability to produce complete musical tracks with coherent musical structure. Using latent diffusion technology, the model can generate compositions up to 3 minutes long containing distinct intro, development and outro sections. This is an advancement from the prior Stable Audio release that could only create short loops or fragments rather than full-length songs.

Looking at the machine learning (ML) science behind Stable Audio 2.0, the model still relies on what is known as a latent diffusion model (LDM). Evans explained that since the Stable Audio 1.1 beta release update that came out in December Stable Audio has had a transformer backbone, making it what he referred to as a “diffusion transformer” model.

“We also increased the amount of data compression we apply to the audio data during training, allowing us to scale the model outputs to three minutes and beyond while maintaining reasonable inference times,” Evans said.

Transforming audio samples with text prompts

In addition to generating audio from text prompts, Stable Audio 2.0 enables audio-to-audio transitions. 

Users can upload audio samples and use natural language instructions to transform the sounds into new variations. This opens up creative workflows like iteratively refining and editing audio by providing textual guidance.

Stable Audio 2.0 also significantly increases the range of sound effects and textures that can be produced through AI generation. Users can prompt the system to generate immersive environments, ambient textures, crowds, cityscapes and more. The model also allows modifying the style and tone of generated or uploaded audio samples.

An ongoing concern across the gen AI landscape is about the proper use of source material to train a model.

Stability AI has prioritized intellectual property protections with its new audio model. To address copyright concerns, Stable Audio 2.0 was trained exclusively on licensed data from AudioSparx, with opt-out requests honored. Audio uploads are monitored using content recognition to prevent copyrighted material from being processed.

Protecting copyright is critical to making sure that Stability AI can commercialize Stable Audio and the technology can be used safely by organizations.  Stable Audio is currently monetized through subscriptions to the Stable Audio web application and will soon be available on the Stable Audio API.

Stable Audio is not however an open model, at least not yet.

“The weights for Stable Audio 2.0 will not be available for download; however, we’re working on open audio models to be released later in the year,” Evans said.