STABLE-AUDIO

Stable Audio 2.5 generates high-quality music and sound effects from text prompts with realistic instruments and sounds.

Avg Run Time: 15.000s

Model Slug: stable-audio-2-5-text-to-audio

Playground

Input

Prompt*

Seconds Total

Num Inference Steps

Guidance Scale

Seed

Output

Example Result

Preview and download your result.

Each execution costs $0.2000. With $1 you can run this model about 5 times.

API & SDK

Create a Prediction

Send a POST request to create a new prediction. This will return a prediction ID that you'll use to check the result. The request should include your model inputs and API key.

Get Prediction Result

Poll the prediction endpoint with the prediction ID until the result is ready. The API uses long-polling, so you'll need to repeatedly check until you receive a success status.

Readme

Table of Contents

Overview

Technical Specifications

Key Considerations

Tips & Tricks

Capabilities

What Can I Use It For?

Things to Be Aware Of

Limitations

Overview

stable-audio-2-5-text-to-audio — Text-to-Audio AI Model

Developed by Stability as part of the stable-audio family, stable-audio-2-5-text-to-audio transforms text prompts into high-quality music tracks and sound effects featuring realistic instruments and immersive audio. This text-to-audio model excels at generating structured compositions up to three minutes long, solving the challenge of creating professional-grade audio without studios or musicians for creators and developers seeking Stability text-to-audio solutions. Unlike basic sound generators, it produces full tracks with intros, progressions, outros, and stereo effects, making it ideal for text-to-voice AI model applications in content production.

Technical Specifications

What Sets stable-audio-2-5-text-to-audio Apart

stable-audio-2-5-text-to-audio stands out in the text-to-audio landscape by generating complete music tracks up to three minutes with defined structure including intro, progression, outro, and stereo sound effects, enabling seamless production of radio-ready songs from simple prompts. This capability allows users to create polished audio for streaming or integration without manual editing, a feature refined from over 800,000 audio files in training for authentic instrument sounds.

Supports extended durations up to 3 minutes for full songs or soundscapes, far beyond short clips in many competitors, perfect for stable-audio-2-5-text-to-audio API in music apps.
Delivers realistic stereo effects and instrument fidelity trained on diverse music, sound effects, and stems, producing outputs suitable for professional mixing.
Structured generation ensures coherent track progression, enabling developers to build AI music generator from text tools with consistent quality.

Technical specs include text prompt inputs yielding WAV or similar high-fidelity outputs, with processing optimized for quick iteration on Stability's platform.

Key Considerations

The model excels at generating complex musical structures and realistic instrument sounds, but prompt specificity greatly influences output quality
For best results, use detailed prompts that specify mood, genre, instrumentation, and structure
Overly vague or conflicting prompts may yield less coherent or generic audio
There is a trade-off between generation speed and audio complexity; more intricate prompts may require slightly longer processing times
Prompt engineering is crucial: clear, descriptive language leads to better alignment between text and audio
Iterative refinement of prompts can help achieve desired results, especially for nuanced or experimental audio requests

Tips & Tricks

How to Use stable-audio-2-5-text-to-audio on Eachlabs

Access stable-audio-2-5-text-to-audio through Eachlabs Playground for instant text-to-audio generation, API for scalable apps, or SDK for custom integrations. Enter detailed prompts specifying genre, instruments, structure, and duration up to three minutes; receive high-fidelity stereo audio outputs ready for download or streaming. Optimize with Stability's prompt tips for best instrument realism and track coherence—start creating professional audio today on Eachlabs.

---

Capabilities

Generates high-fidelity music and sound effects from natural language prompts
Supports complex musical arrangements with multiple sections and transitions
Accurately interprets mood, genre, and instrumentation cues
Produces stereo audio suitable for professional use
Fast generation times enable rapid prototyping and iteration
Adaptable for both desktop and mobile environments (compact version available)
Strong alignment between text prompts and generated audio content

What Can I Use It For?

Use Cases for stable-audio-2-5-text-to-audio

Content creators producing podcasts or videos can input prompts like "upbeat electronic track with synth leads, driving bass, and crowd cheers fading into outro" to generate custom background music up to three minutes, enhancing episodes without licensing fees or stock audio searches.

Game developers building immersive worlds use stable-audio-2-5-text-to-audio's sound effect generation for "explosive fireball whoosh with echoing reverb and debris scatter in stereo," creating dynamic audio layers that sync with gameplay for realistic environments.

Marketers crafting ads for social media leverage its music structuring for "gentle acoustic guitar intro building to uplifting chorus with piano and strings," delivering emotionally resonant tracks tailored to brand vibes via text-to-voice AI model prompts.

Filmmakers and app developers integrate the stable-audio-2-5-text-to-audio API for on-demand Foley sounds like "rain patter on window with thunder rumble transitioning to calm breeze," streamlining post-production for indie projects.

Things to Be Aware Of

Some users report that highly abstract or ambiguous prompts may result in generic or less engaging audio
The model's performance is best with well-structured, descriptive prompts; minimal prompts can lead to repetitive or uninspired outputs
Resource requirements are significant for full-length tracks; high-end GPUs (e.g., Nvidia H100) are recommended for optimal speed
Consistency across generations is generally strong, but minor variations may occur with repeated prompts
Positive feedback emphasizes the model's speed, audio quality, and ability to handle complex musical requests
Some users note occasional artifacts or unnatural transitions in highly experimental or unconventional prompts
The compact version for mobile devices is limited to shorter audio durations and may have reduced fidelity compared to the full model

Limitations

The model may struggle with extremely abstract, contradictory, or underspecified prompts, leading to less coherent audio
High resource requirements for generating long, high-fidelity tracks may limit accessibility for users without powerful hardware
Not optimized for speech synthesis or highly detailed vocal performances; best suited for instrumental music and sound effects

Pricing

Pricing Detail

This model runs at a cost of $0.20 per execution.

Pricing Type: Fixed

The cost remains the same regardless of which model you use or how long it runs. There are no variables affecting the price. It is a set, fixed amount per run, as the name suggests. This makes budgeting simple and predictable because you pay the same fee every time you execute the model.

AI TRENDS

Related AI Models

You can seamlessly integrate advanced AI capabilities into your applications without the hassle of managing complex infrastructure.

Text to Voice

MiniMax Music 2.0 transforms text prompts into high-fidelity, diverse musical compositions, blending advanced AI composition, sound design, and arrangement to deliver studio-quality tracks in seconds.

Minimax Music v2

120 s

Text to Voice

Generates natural-sounding speech from written text. Delivers clear pronunciation, smooth pacing, and expressive tone—ideal for voiceovers, narration, and digital content.

ElevenLabs | Text to Speech

10 s

Text to Voice

Converts written text into natural, lifelike speech with precise timestamps. Offers clear pronunciation, smooth pacing, and expressive delivery, making it ideal for voiceovers, narration, and time synchronized audio content.

ElevenLabs | Text to Speech with Timestamp

7 s

Text to Voice

Mureka Stem Song is a music processing model that separates a song into individual audio components such as vocals and instruments.

Mureka | Stem Song

15 s

Explore More