Example inputhover

format: "mp3"
prompt: "Abandoned subway emergency broadcast audio: deep underground at night, screeching metal rails, flickering lights buzzing, water rushing through tunnels, alarms blaring, distant train brakes screaming with no train in sight. A panicked male dispatcher speaks through heavy radio static: “All passengers must leave platform nine immediately. Do not look into the tunnel. Something is coming up the tracks.” Realistic radio voice, chaotic, tense, cinematic, loud sound effects, no music."
pitch_rate: 0
sample_rate: 24000
speech_rate: 0
loudness_rate: 0

ByteDance Seed Audio 1.0

Audio·seed-audio·by Bytedance

ByteDance Audio 1.0 turns natural-language prompts into speech and audio, with full control over format, sample rate, speed, volume, and pitch.

Try it now →

API reference

Runtime (p50): 1m
Estimated price: $0.0025 / unit

Call the API

prediction.sh

curl -X POST \
  -H "X-API-Key: $EACHLABS_API_KEY" \
  -H "Content-Type: application/json" \
  --data '{
    "model": "bytedance-seed-audio-1-0",
    "version": "0.0.1",
    "input": {
        "format": "mp3",
        "prompt": "Abandoned subway emergency broadcast audio: deep underground at night, screeching metal rails, flickering lights buzzing, water rushing through tunnels, alarms blaring, distant train brakes screaming with no train in sight. A panicked male dispatcher speaks through heavy radio static: “All passengers must leave platform nine immediately. Do not look into the tunnel. Something is coming up the tracks.” Realistic radio voice, chaotic, tense, cinematic, loud sound effects, no music.",
        "pitch_rate": 0,
        "sample_rate": 24000,
        "speech_rate": 0,
        "loudness_rate": 0
    },
    "webhook_url": ""
}' \
  https://api.eachlabs.ai/v1/prediction/

Documentation8 sections

Overview
ByteDance Seed Audio 1.0 Overview

ByteDance Seed Audio 1.0 is a text-to-voice model that generates natural-sounding speech and audio from written prompts. Built by Bytedance in the bytedance-test family, it focuses on configurable audio output so users can tune sample rate, speed, volume, and pitch for different applications. ByteDance Seed Audio 1.0 is designed for developers and creators who need flexible voice synthesis without heavy manual audio editing. Through the ByteDance Seed Audio 1.0 API, teams can embed text-to-audio capabilities into their products and workflows. When hosted on each::labs, the model offers a straightforward way to prototype, test, and deploy text-driven audio experiences with consistent, reproducible output.
Capabilities
Capabilities
- Generates speech from natural-language text for interfaces, narration, and voiceovers.
- Supports configurable audio format, allowing developers to choose suitable encoding and file types per workflow.
- Provides sample rate control so audio can match streaming, mobile, or desktop playback constraints.
- Offers speed, volume, and pitch adjustment to fine-tune prosody and perceived character of the voice.
- Works well for short prompts and segment-based generation, enabling modular assembly of longer scripts.
- Integrates via the ByteDance Seed Audio 1.0 API, making text-to-voice available in back-end services and applications.
- Suitable for programmatic audio generation at scale when consistent style and parameter control are required.
- Can be orchestrated on each::labs alongside other Bytedance models in the bytedance-test family for multi-modal workflows.
Use cases
Use Cases for ByteDance Seed Audio 1.0

Creators can use ByteDance Seed Audio 1.0 to generate draft voiceovers for videos by setting slower speed and neutral pitch, then refining the text until the narration flows naturally. For example: "Calm, explanatory voiceover, slow speed: In this tutorial, we will walk through setting up your new device step by step."

Marketers can produce audio snippets for product pages or ads by tuning volume and pitch to stand out in mobile playback: "Energetic promo line, medium speed, slightly higher volume: Discover a faster way to manage your tasks today."

Developers can integrate the ByteDance Seed Audio 1.0 API into apps for on-demand voice prompts, specifying sample rate to match in-app sound design: "Neutral assistant voice at 16kHz sample rate: Your download has completed successfully."

Designers working on interactive experiences can prototype voice UI responses by varying speed for different scenarios, such as alerts vs. guidance: "Alert voice, fast speed, higher pitch: Warning, your session is about to expire."
Tips & tricks
Tips and Tricks

To get the most out of ByteDance Seed Audio 1.0, treat the prompt as both script and direction. Include the desired tone, pacing, and emphasis directly in the text, and then refine with speed, volume, and pitch parameters. Start with moderate values and adjust in small steps to avoid unnatural prosody. When using the ByteDance Seed Audio 1.0 API, keep prompts short and focused, especially for iterative testing.

Example prompts:

"Read this product description in a friendly, confident voice, medium speed and neutral pitch: Introducing our new smart lamp that adapts to your mood."

"Generate an instructional voiceover with slow speed, slightly higher volume, and steady pitch: First, open the app, then tap the settings icon."

"Create a dynamic intro line for a podcast, faster speed and lower pitch for a more relaxed style: Welcome to our weekly deep dive into developer tools."
Technical spec
Technical Specifications
- Model category: text-to-voice / text-to-audio generation
- Input type: natural-language text prompt plus optional configuration parameters
- Output type: synthesized speech or audio waveform (e.g., common PCM formats)
- Configurable audio format: user-selectable encoding and container (e.g., typical compressed or uncompressed audio formats, depending on integration)
- Sample rate control: adjustable sample rate for matching streaming, mobile, or studio workflows
- Prosody controls: tunable speed, volume, and pitch for expressive or utility-oriented speech
- Latency expectations: optimized for short- to medium-length prompts, with near–real-time generation in typical API use
- Integration: accessible via ByteDance Seed Audio 1.0 API endpoints when deployed on each::labs
Things to be aware of
Things to Be Aware Of

ByteDance Seed Audio 1.0 may require prompt tuning and parameter adjustment to avoid robotic or overly flat delivery, especially for emotional content. Very long scripts generated in one pass can lead to inconsistent pacing; splitting text into smaller segments generally yields more reliable prosody. Extreme settings for speed, volume, or pitch may produce unnatural audio, so incremental changes are usually better. Since the model focuses on configurable synthesis rather than ultra-realistic performance, users seeking highly lifelike voices might need post-processing or complementary tools. When calling the ByteDance Seed Audio 1.0 API through each::labs, it is important to handle audio file size and streaming constraints in client applications.
Key considerations
Key Considerations

ByteDance Seed Audio 1.0 works best with clear, well-structured text and prompts that specify desired speaking style and audio parameters. Users typically provide a text prompt along with preferred sample rate, speed, volume, and pitch values to get predictable results. The model is suitable for product UI voices, quick voiceover drafts, and programmatic audio generation where control is more important than human-level performance acting. For long-form narration or highly emotive dialog, users may need to experiment with parameter settings and break content into segments. When accessed through each::labs, cost and performance depend on prompt length and output duration, making concise prompts and targeted audio lengths more efficient.
Limitations
Limitations

ByteDance Seed Audio 1.0 does not aim to replicate specific human speakers or advanced emotional acting, and its output may sound more synthetic than top-tier human-voice models in some scenarios. It is primarily optimized for clear speech, not for complex soundscapes, background effects, or music generation. Very long or highly expressive narratives can require manual segmentation and multiple passes to maintain consistent quality. Users should also note that the model is focused on text-to-voice; tasks such as audio transcription, translation, or voice cloning are outside its core design.