inference · 31.0s

ACE-Step 1.5 · Text to Music

Audio·eachlabs·by eachlabs

ACE-Step 1.5 is a diffusion and language model–based text-to-music system that generates music with vocals from natural-language prompts and optional custom lyrics. It supports Chain-of-Thought reasoning for higher quality, multi-output batches, multilingual vocals, and automatic detection of BPM, musical key, and time signature. Use markers like [verse], [chorus], [bridge] or [inst] and [instrumental] to structure songs. Outputs FLAC audio with user-defined duration and is billed per output second, with thinking mode charged at double rate.

Runtime (p50)
30s
Estimated price
Usage-based
Call the API
prediction.sh
sh
curl -X POST \
  -H "X-API-Key: $EACHLABS_API_KEY" \
  -H "Content-Type: application/json" \
  --data '{
    "model": "ace-step-1-5-text-to-music",
    "version": "0.0.1",
    "input": {
        "bpm": 30,
        "shift": 3,
        "lyrics": "[verse 1]\nWhere ideas spark and the future begins,\nEachlabs is the place where creation wins.\nFrom text to image, from sound to light,\nWe turn imagination into something bright.\n\n[chorus]\nEachlabs, the power, the speed, the way,\nFrom a single thought to a product in a day.\nBuild it, shape it, let it rise,\nBring your vision to life before your eyes.\n\n[verse 2]\nModels in motion, connected as one,\nAPIs flowing till the work is done.\nCreators and builders, side by side,\nTurning bold dreams into tools with pride.",
        "prompt": "Energetic R&B track with smooth groove, punchy bass, modern drums, and emotional yet powerful vocals.",
        "duration": 60,
        "thinking": true,
        "num_outputs": 1,
        "infer_method": "ode",
        "lm_cfg_scale": 1,
        "guidance_scale": 7,
        "lm_temperature": 0.85,
        "vocal_language": "unknown",
        "lm_negative_prompt": "NO USER INPUT",
        "num_inference_steps": 8,
        "use_constrained_decoding": true
    },
    "webhook_url": ""
}' \
  https://api.eachlabs.ai/v1/prediction/
Documentation8 sections
  • Overview

    ACE-Step 1.5 | Text to Music Overview

    ACE-Step 1.5 | Text to Music revolutionizes music creation by transforming natural-language prompts into full songs with vocals, complete with custom lyrics and structured sections. Hosted on each::labs, part of the eachlabs family, this diffusion and language model-based system stands out with its Chain-of-Thought reasoning, enabling higher-quality outputs through step-by-step musical composition logic. Users can generate professional-grade tracks in FLAC format, supporting multilingual vocals and automatic detection of BPM, key, and time signature.

    Ideal for creators seeking instant music without instruments or studios, ACE-Step 1.5 | Text to Music handles everything from verses to choruses using simple markers like [verse] or [chorus]. It offers multi-output batches for efficiency and user-defined durations, billed per output second—making it a cost-effective choice for each::labs music-generation workflows. Whether prototyping ideas or producing final tracks, this model delivers coherent, structured music from text alone.

  • Capabilities

    Capabilities

    • Generates complete songs with vocals from text prompts and optional custom lyrics
    • Supports song structure via markers: [verse], [chorus], [bridge], [inst], [instrumental]
    • Chain-of-Thought reasoning for improved musical coherence and quality
    • Automatic detection of BPM, musical key, and time signature
    • Multilingual vocal generation for global music creation
    • Multi-output batches to produce variations efficiently
    • User-defined track durations with FLAC output for professional use
    • Accessible via each::labs music-generation API for seamless integration
  • Use cases

    Use Cases for ACE-Step 1.5 | Text to Music

    Content Creators: Produce custom background tracks for YouTube videos. Example: "[intro] Calm ambient [verse] Exploring new worlds [chorus] Adventure calls, cinematic orchestral with soft vocals, 80 BPM"—leveraging auto-BPM detection for perfect sync.

    Marketers: Generate branded jingles quickly. Example: "[chorus] each::labs AI magic, upbeat electronic pop 120 BPM, male rap vocals"—using structure markers for catchy hooks in ads.

    Music Producers: Prototype song ideas with vocals. Enable Chain-of-Thought for "[bridge] Emotional guitar solo [outro] Fade with echoes, indie rock 100 BPM"—iterating batches for refinements.

    Developers: Integrate into apps via ACE-Step 1.5 | Text to Music API. Example prompt for user-generated multilingual tracks: "Spanish flamenco [verse] Noche de pasión, detect key"—powering dynamic soundtracks.

  • Tips & tricks

    Tips and Tricks

    Master ACE-Step 1.5 | Text to Music with precise prompt engineering: Use markers like [verse], [chorus], [bridge], [inst] for instrumental, or [instrumental] to structure songs explicitly. Include genre, mood, tempo hints, and custom lyrics for best results. Enable Chain-of-Thought for complex tracks to leverage step-by-step reasoning.

    Optimize parameters by specifying duration upfront and starting with batches of 2-4 outputs. For multilingual vocals, prefix with language, e.g., "French pop ballad." Example prompts:

    • "[intro] Soft piano [verse] Heartbreak in the rain, she left me alone [chorus] I'll never love again, 90 BPM pop ballad with female vocals"
    • "[instrumental] Epic orchestral build-up to [drop] heavy EDM synths, 128 BPM, no lyrics"
    • "[verse 1] Waking up early [chorus] Coffee and dreams, upbeat folk with male vocals, detect key"

    Iterate by refining based on auto-detected BPM and key outputs.

  • Technical spec

    Technical Specifications

    • Model Type: Diffusion + language model for text-to-music generation
    • Output Format: High-fidelity FLAC audio files
    • Max Duration: User-defined, up to practical limits based on billing (per second of output)
    • Audio Features: Automatic BPM, musical key, and time signature detection; multilingual vocals
    • Batch Support: Multi-output generation for efficiency
    • Reasoning Mode: Chain-of-Thought for enhanced quality (billed at double rate)
    • Input: Natural-language prompts with optional custom lyrics and structure markers
    • Processing Time: Varies by duration and complexity; typically seconds to minutes per track
    • API Access: Available via each::labs ACE-Step 1.5 | Text to Music API
  • Things to be aware of

    Things to Be Aware Of

    ACE-Step 1.5 | Text to Music excels with clear, structured prompts but may produce inconsistent vocals in overly abstract descriptions. Edge cases like extreme genres (e.g., avant-garde noise) or very long durations (>5 minutes) increase processing time and variability. Common mistakes include omitting markers, leading to unstructured outputs—always specify [verse]/[chorus].

    Resource needs are low, but Chain-of-Thought mode suits high-end hardware indirectly via cloud. Monitor costs for iterative workflows on each::labs.

  • Key considerations

    Key Considerations

    Before using ACE-Step 1.5 | Text to Music, note it's billed per output second, with Chain-of-Thought mode doubling costs for superior results—perfect for final productions but consider basic mode for drafts. No prerequisites beyond a each::labs account; prompts work best in English but support multilingual vocals. Opt for this model over simpler generators when needing structured songs with vocals and auto-analysis of musical elements.

    Performance shines in creative workflows but may vary with prompt complexity. Test short durations first to optimize costs in each::labs music-generation pipelines. Ideal for users valuing vocal coherence and song structure without manual editing.

  • Limitations

    Limitations

    ACE-Step 1.5 | Text to Music cannot import existing audio or MIDI for remixing; it's purely text-driven. Vocals may lack perfect pitch accuracy in complex polyphony, and outputs cap at user-defined durations without infinite loops. Rare prompt ambiguities cause genre drifts. No real-time generation—processing takes time. Multilingual support varies by language prominence.

    ---

Related models

4 models