STABLE-AUDIO
Stable Audio 2.5 Inpaint allows you to edit and replace parts of existing audio using text prompts, making it easy to refine or transform music and sound effects with high quality results.
Avg Run Time: 15.000s
Model Slug: stable-audio-2-5-inpaint
Playground
Input
Output
Example Result
Preview and download your result.
API & SDK
Create a Prediction
Send a POST request to create a new prediction. This will return a prediction ID that you'll use to check the result. The request should include your model inputs and API key.
Get Prediction Result
Poll the prediction endpoint with the prediction ID until the result is ready. The API uses long-polling, so you'll need to repeatedly check until you receive a success status.
Readme
Overview
Stable Audio 2.5 Inpaint is an advanced AI model designed for high-fidelity audio editing, specifically enabling users to modify, replace, or refine segments of existing audio tracks using natural language prompts. Developed by Stability AI, this model leverages state-of-the-art latent diffusion techniques to achieve precise, high-quality inpainting of music and sound effects. The model is particularly notable for its ability to maintain the coherence and quality of the original audio while introducing new elements or transformations as specified by the user.
Key features include text-guided audio inpainting, support for stereo audio, and the ability to handle complex editing tasks such as adding, removing, or transforming specific sounds within a track. The model is built on a latent diffusion architecture, which encodes audio into a compressed latent space, allowing for efficient and effective manipulation. This approach enables the model to generate semantically meaningful and temporally coherent edits, setting it apart from traditional audio editing tools that rely on manual waveform manipulation or spectrogram-based approaches.
What makes Stable Audio 2.5 Inpaint unique is its combination of high-quality output, semantic alignment with user prompts, and versatility across a wide range of audio editing scenarios. The model’s architecture supports fine-grained control over edits, spatial audio effects, and robust handling of both music and general sound effects, making it a powerful tool for musicians, sound designers, and content creators seeking advanced audio editing capabilities.
Technical Specifications
- Architecture: Latent Diffusion Model (LDM) with a 1D-CNN-based Variational Autoencoder (VAE) for audio encoding and decoding
- Parameters: Not publicly specified in available documentation
- Resolution: Supports high-fidelity audio up to 48 kHz and potentially higher (e.g., 96 kHz or 192 kHz in research benchmarks)
- Input/Output formats: Stereo audio, typically in WAV or similar uncompressed formats; accepts text prompts for editing instructions
- Performance metrics: Evaluated using Fréchet Distance (FD), Fréchet Audio Distance (FAD), Log-Spectral Distance (LSD), Structural Similarity Index (SSIM), Inception Score (IS), and CLAP score for semantic alignment
Key Considerations
- The quality of inpainting is highly dependent on the clarity and specificity of the text prompt; ambiguous prompts may yield less predictable results
- Best results are achieved when editing relatively clean, well-segmented audio; heavily layered or noisy tracks may present challenges
- For optimal performance, ensure input audio is properly pre-processed (e.g., normalized, trimmed to relevant sections)
- There is a trade-off between edit speed and output quality; higher fidelity settings may increase inference time
- Prompt engineering is crucial: detailed, context-aware prompts lead to more accurate and semantically aligned edits
- Overlapping or conflicting instructions in prompts can confuse the model and degrade output quality
Tips & Tricks
- Use concise, descriptive prompts specifying both the target sound and the desired change (e.g., “replace the guitar solo with a piano melody”)
- For spatial edits, include explicit spatial cues (e.g., “move the drum sound to the left channel”)
- When refining results, iteratively adjust the prompt and re-run the model to converge on the desired outcome
- For subtle edits, specify the time range or segment to be modified to avoid unintended changes elsewhere in the audio
- Combine multiple editing steps by chaining prompts and edits, allowing for complex transformations in stages
- For best results with music, reference genre, instrument, or mood in the prompt to guide the model’s creative choices
Capabilities
- Performs high-quality audio inpainting, enabling seamless replacement or modification of audio segments
- Supports text-guided editing for both music and general sound effects
- Maintains temporal and spatial coherence in stereo audio edits
- Handles complex editing tasks such as adding, removing, or transforming specific audio events
- Delivers outputs with strong semantic alignment to user instructions, as validated by both objective metrics and human evaluations
- Adaptable to a wide range of audio types, including speech, music, and environmental sounds
What Can I Use It For?
- Professional music production: refining instrument tracks, replacing solos, or adding new layers based on creative direction
- Sound design for film, games, and multimedia: transforming sound effects, adjusting spatial properties, or inpainting missing audio segments
- Podcast and voice editing: removing unwanted noises, replacing spoken phrases, or enhancing audio clarity
- Restoration of damaged or incomplete audio recordings by intelligently filling gaps
- Creative remixing and mashups, enabling rapid prototyping of new musical ideas
- Educational projects: demonstrating audio editing concepts or generating examples for teaching sound engineering
Things to Be Aware Of
- Some experimental features, such as advanced spatial audio manipulation, may not be fully stable and could produce inconsistent results
- Users have reported occasional artifacts or unnatural transitions, especially when editing highly complex or layered audio
- Performance benchmarks indicate that higher quality settings can significantly increase inference time, requiring more computational resources
- The model’s effectiveness can vary depending on the genre and complexity of the input audio; simpler tracks yield more reliable edits
- Consistency across multiple edits is generally strong, but edge cases (e.g., overlapping sounds, rapid transients) may challenge the model
- Positive feedback highlights the model’s ability to produce musically coherent and high-fidelity edits with minimal manual intervention
- Some users note that achieving very specific or nuanced edits may require multiple iterations and careful prompt tuning
Limitations
- The model may struggle with extremely dense or noisy audio, leading to artifacts or loss of detail in the edited segments
- Not optimal for real-time or low-latency applications due to inference speed at high quality settings
- Fine-grained control over micro-level audio features (e.g., subtle timbral changes) is limited compared to manual editing by expert engineers
Pricing
Pricing Detail
This model runs at a cost of $0.20 per execution.
Pricing Type: Fixed
The cost remains the same regardless of which model you use or how long it runs. There are no variables affecting the price. It is a set, fixed amount per run, as the name suggests. This makes budgeting simple and predictable because you pay the same fee every time you execute the model.
Related AI Models
You can seamlessly integrate advanced AI capabilities into your applications without the hassle of managing complex infrastructure.


