stability/stable-audio
Generate music and sound effects with Stability AI's Stable Audio.Readme
stable-audio by Stability — AI Model Family
Stable Audio by Stability AI is a cutting-edge family of generative AI models designed to create high-quality music, sound effects, and audio from text descriptions or input audio. This family addresses the challenge of producing professional-grade audio content quickly and accessibly, empowering creators to generate cinematic soundscapes, tracks, and effects without traditional production tools. Trained on nearly 800,000 labeled audio files from sources like AudioSparx, it excels in text-to-audio synthesis, delivering coherent musical compositions and realistic sound design. The family includes three key models under Stable Audio 2.5: Text to Audio (Text to Voice) for generating audio from prompts, Audio to Audio (Voice to Voice) for transforming existing audio, and Inpaint (Voice to Voice) for precise audio editing and extension.
stable-audio Capabilities and Use Cases
The Stable Audio family offers versatile capabilities across text-to-audio generation, audio transformation, and inpainting, supporting a range of creative workflows from music production to sound design.
-
Stable Audio 2.5 | Text to Audio (Text to Voice): This model converts natural language prompts into full music tracks or sound effects, producing high-fidelity audio with coherent structure and instrumentation. Ideal for rapid prototyping, use it to generate "a dreamy ambient track with finger-picked electric guitar, reverb, and subtle delay at 90 BPM" to create atmospheric backgrounds for videos or games. It supports up to high-quality waveforms, leveraging a 44.1kHz sampling rate for professional results.
-
Stable Audio 2.5 | Audio to Audio (Voice to Voice): Transform input audio by applying new styles, moods, or instruments while preserving core elements like key, BPM, and harmony. Content creators use this for remixing stems—input a basic melody and output "upbeat orchestral version with strings and brass dynamics"—streamlining music evolution without starting from scratch.
-
Stable Audio 2.5 | Inpaint (Voice to Voice): Fill gaps or extend audio seamlessly, maintaining consistency in timbre and flow. Filmmakers apply it to extend sound effects, such as inpainting missing sections in a Foley track to match "cinematic explosion with spatial reverberation and spectral fusion," ensuring polished continuity.
These models integrate into pipelines for end-to-end production: start with Text to Audio for a base track, refine with Audio to Audio for stylistic variations, and use Inpaint for final tweaks. Technical specs include 44.1kHz sampling, support for long-sequence generation up to coherent tracks, and latent space processing via a specialized VAE for efficient, high-fidelity reconstruction. Output formats focus on waveforms suitable for stems, music, and effects.
What Makes stable-audio Stand Out
Stable Audio distinguishes itself through superior audio quality and structural coherence, outperforming peers in perceptual fidelity and naturalness. Its VAE-based architecture, operating at 44.1kHz with a 64-dimensional latent space, excels in reconstructing high-fidelity waveforms even from lossy inputs, achieving top metrics like STOI for intelligibility, WB-PESQ for perceptual quality, and low FAD for distribution fidelity. Unlike earlier text-to-music tools lacking in conditioning, Stable Audio delivers cinematic-grade output with precise control over elements like timbre, spatial reverberation, and harmonic structure.
Key strengths include speed and consistency in generating full tracks from prompts, robustness to complex descriptions (e.g., multi-instrument vibes with dynamics like pizzicato or legato), and extensibility for stems and effects. It handles long-context modeling effectively, avoiding interruptions in extended audio, thanks to Transformer-based backbones. This makes it ideal for music producers, game audio designers, filmmakers, and podcasters seeking professional results without studios—offering granular control via natural language while maintaining acoustic richness like mixing effects and atmosphere.
Access stable-audio Models via each::labs API
each::labs is the premier platform for integrating the full stable-audio family through a unified API, giving developers and creators instant access to all Stable Audio 2.5 models—Text to Audio, Audio to Audio, and Inpaint—in one seamless endpoint. Experiment in the interactive Playground to test prompts and previews, or deploy at scale with the robust SDK supporting custom pipelines and high-volume generation. Unlock cinematic audio innovation effortlessly on eachlabs.ai. Sign up to explore the full stable-audio model family on each::labs.