each::sense is in private beta.
Eachlabs | AI Workflows for app builders
stable-audio-2-5-text-to-audio

STABLE-AUDIO

Stable Audio 2.5 generates high-quality music and sound effects from text prompts with realistic instruments and sounds.

Avg Run Time: 15.000s

Model Slug: stable-audio-2-5-text-to-audio

Playground

Input

Output

Example Result

Preview and download your result.

Each execution costs $0.2000. With $1 you can run this model about 5 times.

API & SDK

Create a Prediction

Send a POST request to create a new prediction. This will return a prediction ID that you'll use to check the result. The request should include your model inputs and API key.

Get Prediction Result

Poll the prediction endpoint with the prediction ID until the result is ready. The API uses long-polling, so you'll need to repeatedly check until you receive a success status.

Readme

Table of Contents
Overview
Technical Specifications
Key Considerations
Tips & Tricks
Capabilities
What Can I Use It For?
Things to Be Aware Of
Limitations

Overview

Stable Audio 2.5 is an advanced text-to-audio generation model developed by Stability AI, designed to produce high-quality music and sound effects from natural language prompts. The model is targeted at professional sound production and creative teams, enabling the rapid creation of complex, customizable audio content at scale. It is capable of generating realistic instrument sounds and intricate musical structures, including multi-part compositions with intros, developments, and outros.

Key features of Stable Audio 2.5 include its ability to interpret nuanced mood and genre cues, such as "uplifting" or "lush synthesizers," and to generate audio tracks up to three minutes in length within seconds. The model leverages a post-training technique called Adversarial Relativistic-Contrastive (ARC), which enhances both the speed and quality of generation. This approach allows Stable Audio 2.5 to deliver professional-grade results with low latency, making it suitable for both desktop and mobile environments. Its unique strengths lie in its rapid processing, high fidelity, and improved alignment between textual prompts and generated audio, setting it apart from earlier text-to-audio systems.

Technical Specifications

  • Architecture: Diffusion-based generative model with ARC (Adversarial Relativistic-Contrastive) post-training
  • Parameters: Not publicly disclosed
  • Resolution: Stereo audio, up to three minutes in length; compact version supports up to eleven seconds on mobile
  • Input/Output formats: Text prompts as input; output is high-quality stereo audio (common formats include WAV and MP3, though specifics may vary)
  • Performance metrics: Generation time under two seconds for three-minute tracks on Nvidia H100 GPUs; improved perceptual quality and text-audio alignment as measured by CLAP similarity, Production Quality (PQ), and AQAScore metrics

Key Considerations

  • The model excels at generating complex musical structures and realistic instrument sounds, but prompt specificity greatly influences output quality
  • For best results, use detailed prompts that specify mood, genre, instrumentation, and structure
  • Overly vague or conflicting prompts may yield less coherent or generic audio
  • There is a trade-off between generation speed and audio complexity; more intricate prompts may require slightly longer processing times
  • Prompt engineering is crucial: clear, descriptive language leads to better alignment between text and audio
  • Iterative refinement of prompts can help achieve desired results, especially for nuanced or experimental audio requests

Tips & Tricks

  • Use explicit genre and mood descriptors (e.g., "cinematic orchestral score with uplifting strings and deep percussion") to guide the model toward specific styles
  • Specify structure elements such as "intro," "build-up," "climax," and "outro" for multi-part compositions
  • For sound effects, describe the source, environment, and intended emotional impact (e.g., "gentle rain on a tin roof, calming and ambient")
  • Adjust prompt length and detail based on desired complexity; concise prompts yield simpler outputs, while detailed prompts enable richer audio
  • If initial results are unsatisfactory, iteratively refine the prompt by adding or removing descriptors and re-generating
  • Experiment with prompt variations to explore the model's creative range and discover unexpected audio possibilities

Capabilities

  • Generates high-fidelity music and sound effects from natural language prompts
  • Supports complex musical arrangements with multiple sections and transitions
  • Accurately interprets mood, genre, and instrumentation cues
  • Produces stereo audio suitable for professional use
  • Fast generation times enable rapid prototyping and iteration
  • Adaptable for both desktop and mobile environments (compact version available)
  • Strong alignment between text prompts and generated audio content

What Can I Use It For?

  • Professional music production for film, games, and advertising, as documented in industry articles and case studies
  • Rapid prototyping of soundtracks and sound effects for multimedia projects
  • Creative exploration and ideation for composers, producers, and sound designers, as shared in user forums and blogs
  • Generating background music for podcasts, videos, and live streams
  • Personal creative projects, such as custom ringtones or ambient soundscapes, as reported by users on GitHub and Reddit
  • Educational applications, including music theory demonstrations and interactive learning tools
  • Industry-specific uses, such as branded audio for marketing or immersive audio for virtual environments

Things to Be Aware Of

  • Some users report that highly abstract or ambiguous prompts may result in generic or less engaging audio
  • The model's performance is best with well-structured, descriptive prompts; minimal prompts can lead to repetitive or uninspired outputs
  • Resource requirements are significant for full-length tracks; high-end GPUs (e.g., Nvidia H100) are recommended for optimal speed
  • Consistency across generations is generally strong, but minor variations may occur with repeated prompts
  • Positive feedback emphasizes the model's speed, audio quality, and ability to handle complex musical requests
  • Some users note occasional artifacts or unnatural transitions in highly experimental or unconventional prompts
  • The compact version for mobile devices is limited to shorter audio durations and may have reduced fidelity compared to the full model

Limitations

  • The model may struggle with extremely abstract, contradictory, or underspecified prompts, leading to less coherent audio
  • High resource requirements for generating long, high-fidelity tracks may limit accessibility for users without powerful hardware
  • Not optimized for speech synthesis or highly detailed vocal performances; best suited for instrumental music and sound effects

Pricing

Pricing Detail

This model runs at a cost of $0.20 per execution.

Pricing Type: Fixed

The cost remains the same regardless of which model you use or how long it runs. There are no variables affecting the price. It is a set, fixed amount per run, as the name suggests. This makes budgeting simple and predictable because you pay the same fee every time you execute the model.