STABLE-AUDIO

Stable Audio 2.5 Inpaint allows you to edit and replace parts of existing audio using text prompts, making it easy to refine or transform music and sound effects with high quality results.

Avg Run Time: 15.000s

Model Slug: stable-audio-2-5-inpaint

Playground

Input

Prompt*

Seconds Total

Advanced Controls

Output

Example Result

Preview and download your result.

Each execution costs $0.2000. With $1 you can run this model about 5 times.

API & SDK

Create a Prediction

Send a POST request to create a new prediction. This will return a prediction ID that you'll use to check the result. The request should include your model inputs and API key.

Get Prediction Result

Poll the prediction endpoint with the prediction ID until the result is ready. The API uses long-polling, so you'll need to repeatedly check until you receive a success status.

Readme

Table of Contents

Overview

Technical Specifications

Key Considerations

Tips & Tricks

Capabilities

What Can I Use It For?

Things to Be Aware Of

Limitations

Overview

stable-audio-2-5-inpaint — Audio Editing AI Model

Developed by Stability as part of the Stable Audio family, stable-audio-2-5-inpaint is an audio inpainting model that lets you edit and replace specific sections of existing audio using text prompts. Instead of re-recording or manually editing audio tracks, you describe what you want to change—a drum pattern, a vocal layer, a sound effect—and the model generates high-quality audio that seamlessly blends with your original content. This approach solves a critical problem for music producers, sound designers, and content creators: refining audio without losing the original performance's character or requiring expensive re-recording sessions.

The model excels at maintaining audio coherence and quality when editing music and sound effects, making it an essential tool for anyone working with audio-to-text editing workflows. Whether you're fixing a problematic section of a recording, experimenting with alternative arrangements, or adding new elements to an existing mix, stable-audio-2-5-inpaint delivers professional-grade results through an intuitive text-prompt interface.

Technical Specifications

What Sets stable-audio-2-5-inpaint Apart

Precise Audio Region Targeting: Unlike full-track regeneration models, stable-audio-2-5-inpaint allows you to isolate and edit specific sections of audio while preserving the rest of the track. This surgical precision means you can fix a single vocal phrase, replace a drum break, or add a new sound effect without touching the surrounding audio, maintaining temporal and tonal consistency across your entire composition.

Text-Driven Audio Refinement: The model interprets natural language descriptions to generate replacement audio segments. You can specify musical style, instrumentation, mood, and technical details in your prompt—"add a warm analog synth pad underneath the chorus" or "replace the snare with a crisp, tight hit"—and receive audio that matches your creative vision without manual synthesis or plugin tweaking.

High-Quality Music and Sound Effects: Stable Audio 2.5 Inpaint produces broadcast-ready audio suitable for professional music production, podcast editing, game sound design, and commercial content creation. The model maintains audio fidelity and prevents artifacts that plague lower-quality audio editing tools, ensuring your final mix sounds polished and intentional.

The model supports standard audio formats and integrates seamlessly into audio editing workflows through the Eachlabs API, making it accessible for developers building AI audio editing tools and platforms.

Key Considerations

The quality of inpainting is highly dependent on the clarity and specificity of the text prompt; ambiguous prompts may yield less predictable results
Best results are achieved when editing relatively clean, well-segmented audio; heavily layered or noisy tracks may present challenges
For optimal performance, ensure input audio is properly pre-processed (e.g., normalized, trimmed to relevant sections)
There is a trade-off between edit speed and output quality; higher fidelity settings may increase inference time
Prompt engineering is crucial: detailed, context-aware prompts lead to more accurate and semantically aligned edits
Overlapping or conflicting instructions in prompts can confuse the model and degrade output quality

Tips & Tricks

How to Use stable-audio-2-5-inpaint on Eachlabs

Access stable-audio-2-5-inpaint through Eachlabs' Playground for instant experimentation or integrate it via API for production workflows. Provide your source audio file, specify the region you want to edit (start and end timestamps), and submit a text prompt describing the desired audio. The model processes your request and returns high-quality inpainted audio that blends seamlessly with your original track. Use the Eachlabs SDK to automate batch audio editing or build custom audio editing applications with stable-audio-2-5-inpaint at the core.

---END_RESEARCH_OUTPUT---

Capabilities

Performs high-quality audio inpainting, enabling seamless replacement or modification of audio segments
Supports text-guided editing for both music and general sound effects
Maintains temporal and spatial coherence in stereo audio edits
Handles complex editing tasks such as adding, removing, or transforming specific audio events
Delivers outputs with strong semantic alignment to user instructions, as validated by both objective metrics and human evaluations
Adaptable to a wide range of audio types, including speech, music, and environmental sounds

What Can I Use It For?

Use Cases for stable-audio-2-5-inpaint

Music Producers Refining Arrangements: A producer working on a track realizes the original drum pattern doesn't fit the final arrangement. Instead of re-recording or manually editing, they use stable-audio-2-5-inpaint to replace just the drum section with a prompt like "tight, syncopated hi-hat pattern with punchy kick drum, 120 BPM." The new drums integrate seamlessly with the existing bass and melody, saving hours of manual work while preserving the original performance's energy.

Podcast and Voice-Over Editors: Audio engineers managing podcast production need to fix a section with background noise or unclear speech. Rather than re-recording the segment, they isolate the problematic region and use text prompts to regenerate clean audio that matches the speaker's tone and pace. This audio inpainting approach accelerates post-production timelines while maintaining consistent sound quality across episodes.

Game and Film Sound Designers: Sound designers creating adaptive audio for games or interactive media use stable-audio-2-5-inpaint to generate contextual sound effects and ambient layers. A designer might prompt the model to "add subtle wind and distant thunder to the forest ambience" or "layer a tense string swell under the dialogue," building complex soundscapes without recording every element separately.

Content Creators and Streamers: YouTubers and streamers need to replace copyrighted music or add licensed tracks to existing footage. They use stable-audio-2-5-inpaint to edit background music sections or generate royalty-free alternatives that match the mood and tempo of their original content, ensuring their videos remain monetizable while maintaining professional audio quality.

Things to Be Aware Of

Some experimental features, such as advanced spatial audio manipulation, may not be fully stable and could produce inconsistent results
Users have reported occasional artifacts or unnatural transitions, especially when editing highly complex or layered audio
Performance benchmarks indicate that higher quality settings can significantly increase inference time, requiring more computational resources
The model’s effectiveness can vary depending on the genre and complexity of the input audio; simpler tracks yield more reliable edits
Consistency across multiple edits is generally strong, but edge cases (e.g., overlapping sounds, rapid transients) may challenge the model
Positive feedback highlights the model’s ability to produce musically coherent and high-fidelity edits with minimal manual intervention
Some users note that achieving very specific or nuanced edits may require multiple iterations and careful prompt tuning

Limitations

The model may struggle with extremely dense or noisy audio, leading to artifacts or loss of detail in the edited segments
Not optimal for real-time or low-latency applications due to inference speed at high quality settings
Fine-grained control over micro-level audio features (e.g., subtle timbral changes) is limited compared to manual editing by expert engineers

Pricing

Pricing Detail

This model runs at a cost of $0.20 per execution.

Pricing Type: Fixed

The cost remains the same regardless of which model you use or how long it runs. There are no variables affecting the price. It is a set, fixed amount per run, as the name suggests. This makes budgeting simple and predictable because you pay the same fee every time you execute the model.

AI TRENDS

Related AI Models

You can seamlessly integrate advanced AI capabilities into your applications without the hassle of managing complex infrastructure.

Voice to Voice

Automatically translates and dubs speech into other languages while matching voice tone and emotion. Ideal for videos, films, and global content.

ElevenLabs | Dubbing

70 s

Voice to Voice

Updated to OpenVoice v2: Versatile Instant Voice Cloning

Open Voice

14 s

Voice to Voice

Chatterbox Speech to Speech is a speech model that takes spoken input and produces natural, clear spoken output. It delivers realistic voice results with smooth pacing and easy-to-understand audio.

Chatterbox | Speech to Speech

10 s

Voice to Voice

A production-ready voice cloning service that provides AI-powered voice synthesis using ElevenLabs technology. This service creates custom voice models from audio samples and returns a voice_id that can be used for text-to-speech generation with natural-sounding results.

Elevenlabs Voice Clone

20 s

Explore More