STABLE-AUDIO
Stable Audio 2.5 generates high-quality music and sound effects from text prompts with realistic instruments and sounds.
Avg Run Time: 15.000s
Model Slug: stable-audio-2-5-text-to-audio
Playground
Input
Output
Example Result
Preview and download your result.
API & SDK
Create a Prediction
Send a POST request to create a new prediction. This will return a prediction ID that you'll use to check the result. The request should include your model inputs and API key.
Get Prediction Result
Poll the prediction endpoint with the prediction ID until the result is ready. The API uses long-polling, so you'll need to repeatedly check until you receive a success status.
Readme
Overview
Stable Audio 2.5 is an advanced text-to-audio generation model developed by Stability AI, designed to produce high-quality music and sound effects from natural language prompts. The model is targeted at professional sound production and creative teams, enabling the rapid creation of complex, customizable audio content at scale. It is capable of generating realistic instrument sounds and intricate musical structures, including multi-part compositions with intros, developments, and outros.
Key features of Stable Audio 2.5 include its ability to interpret nuanced mood and genre cues, such as "uplifting" or "lush synthesizers," and to generate audio tracks up to three minutes in length within seconds. The model leverages a post-training technique called Adversarial Relativistic-Contrastive (ARC), which enhances both the speed and quality of generation. This approach allows Stable Audio 2.5 to deliver professional-grade results with low latency, making it suitable for both desktop and mobile environments. Its unique strengths lie in its rapid processing, high fidelity, and improved alignment between textual prompts and generated audio, setting it apart from earlier text-to-audio systems.
Technical Specifications
- Architecture: Diffusion-based generative model with ARC (Adversarial Relativistic-Contrastive) post-training
- Parameters: Not publicly disclosed
- Resolution: Stereo audio, up to three minutes in length; compact version supports up to eleven seconds on mobile
- Input/Output formats: Text prompts as input; output is high-quality stereo audio (common formats include WAV and MP3, though specifics may vary)
- Performance metrics: Generation time under two seconds for three-minute tracks on Nvidia H100 GPUs; improved perceptual quality and text-audio alignment as measured by CLAP similarity, Production Quality (PQ), and AQAScore metrics
Key Considerations
- The model excels at generating complex musical structures and realistic instrument sounds, but prompt specificity greatly influences output quality
- For best results, use detailed prompts that specify mood, genre, instrumentation, and structure
- Overly vague or conflicting prompts may yield less coherent or generic audio
- There is a trade-off between generation speed and audio complexity; more intricate prompts may require slightly longer processing times
- Prompt engineering is crucial: clear, descriptive language leads to better alignment between text and audio
- Iterative refinement of prompts can help achieve desired results, especially for nuanced or experimental audio requests
Tips & Tricks
- Use explicit genre and mood descriptors (e.g., "cinematic orchestral score with uplifting strings and deep percussion") to guide the model toward specific styles
- Specify structure elements such as "intro," "build-up," "climax," and "outro" for multi-part compositions
- For sound effects, describe the source, environment, and intended emotional impact (e.g., "gentle rain on a tin roof, calming and ambient")
- Adjust prompt length and detail based on desired complexity; concise prompts yield simpler outputs, while detailed prompts enable richer audio
- If initial results are unsatisfactory, iteratively refine the prompt by adding or removing descriptors and re-generating
- Experiment with prompt variations to explore the model's creative range and discover unexpected audio possibilities
Capabilities
- Generates high-fidelity music and sound effects from natural language prompts
- Supports complex musical arrangements with multiple sections and transitions
- Accurately interprets mood, genre, and instrumentation cues
- Produces stereo audio suitable for professional use
- Fast generation times enable rapid prototyping and iteration
- Adaptable for both desktop and mobile environments (compact version available)
- Strong alignment between text prompts and generated audio content
What Can I Use It For?
- Professional music production for film, games, and advertising, as documented in industry articles and case studies
- Rapid prototyping of soundtracks and sound effects for multimedia projects
- Creative exploration and ideation for composers, producers, and sound designers, as shared in user forums and blogs
- Generating background music for podcasts, videos, and live streams
- Personal creative projects, such as custom ringtones or ambient soundscapes, as reported by users on GitHub and Reddit
- Educational applications, including music theory demonstrations and interactive learning tools
- Industry-specific uses, such as branded audio for marketing or immersive audio for virtual environments
Things to Be Aware Of
- Some users report that highly abstract or ambiguous prompts may result in generic or less engaging audio
- The model's performance is best with well-structured, descriptive prompts; minimal prompts can lead to repetitive or uninspired outputs
- Resource requirements are significant for full-length tracks; high-end GPUs (e.g., Nvidia H100) are recommended for optimal speed
- Consistency across generations is generally strong, but minor variations may occur with repeated prompts
- Positive feedback emphasizes the model's speed, audio quality, and ability to handle complex musical requests
- Some users note occasional artifacts or unnatural transitions in highly experimental or unconventional prompts
- The compact version for mobile devices is limited to shorter audio durations and may have reduced fidelity compared to the full model
Limitations
- The model may struggle with extremely abstract, contradictory, or underspecified prompts, leading to less coherent audio
- High resource requirements for generating long, high-fidelity tracks may limit accessibility for users without powerful hardware
- Not optimized for speech synthesis or highly detailed vocal performances; best suited for instrumental music and sound effects
Pricing
Pricing Detail
This model runs at a cost of $0.20 per execution.
Pricing Type: Fixed
The cost remains the same regardless of which model you use or how long it runs. There are no variables affecting the price. It is a set, fixed amount per run, as the name suggests. This makes budgeting simple and predictable because you pay the same fee every time you execute the model.
Related AI Models
You can seamlessly integrate advanced AI capabilities into your applications without the hassle of managing complex infrastructure.
