each::sense is live
Eachlabs | AI Workflows for app builders
xtts-v2

XTTS

XTTS is a Voice generation model that lets you clone voices into different languages by using just a quick 6-second audio clip.

Avg Run Time: 20.000s

Model Slug: xtts-v2

Playground

Input

Enter a URL or choose a file from your computer.

Output

Example Result

Preview and download your result.

The total cost depends on how long the model runs. It costs $0.001540 per second. Based on an average runtime of 20 seconds, each run costs about $0.0308. With a $1 budget, you can run the model around 32 times.

API & SDK

Create a Prediction

Send a POST request to create a new prediction. This will return a prediction ID that you'll use to check the result. The request should include your model inputs and API key.

Get Prediction Result

Poll the prediction endpoint with the prediction ID until the result is ready. The API uses long-polling, so you'll need to repeatedly check until you receive a success status.

Readme

Table of Contents
Overview
Technical Specifications
Key Considerations
Tips & Tricks
Capabilities
What Can I Use It For?
Things to Be Aware Of
Limitations

Overview

xtts-v2 — Voice-to-Voice AI Model

xtts-v2, developed by Coqui as part of the XTTS family, is a voice-to-voice AI model that clones voices into different languages using just a 6-second audio clip, enabling high-quality multilingual speech synthesis without extensive training data.

This zero-shot voice cloning capability sets xtts-v2 apart in the Coqui voice-to-voice landscape, supporting 17 languages like English, Spanish, Hindi, Dutch, and Russian for seamless cross-language transfer.

Ideal for developers seeking xtts-v2 API integration or creators exploring voice-to-voice AI models, it delivers expressive output with emotion and style preservation, making it a go-to for efficient audio production.

Technical Specifications

What Sets xtts-v2 Apart

xtts-v2 excels with zero-shot voice cloning from a mere 6-second audio sample, a specific edge over models requiring longer data or phoneme alignment. This enables instant multilingual voice generation, perfect for rapid prototyping in Coqui voice-to-voice applications without fine-tuning delays.

Unlike many TTS systems limited to fewer languages, xtts-v2 natively handles 17 languages with cross-lingual transfer and emotion/style replication, including whispers and laughter. Users gain authentic, expressive speech in diverse tongues, streamlining global content creation.

It supports low-latency streaming under 200ms, outperforming bulkier alternatives in real-time scenarios. This facilitates live voice conversion demos or interactive apps via the xtts-v2 API, with outputs in standard WAV format.

  • Multilingual zero-shot cloning: 17 languages from 6s clip, no alignment needed.
  • Expressive transfer: Captures emotion, prosody for natural inflection.
  • Streaming inference: <200ms latency on GPU.

Technical specs include GPU-accelerated inference (CUDA/ROCm), fine-tuning in 3-5 hours on RTX 4090, and Python API for voice conversion with source/target WAV inputs.

Key Considerations

Language-Specific Nuances: Ensure the text input aligns with the selected language to avoid unnatural pronunciation.

Speaker File Quality: Poor-quality or noisy speaker files can negatively impact the generated output. Use clean recordings for better results.

Output Clarity: Long or overly complex text inputs may produce less natural results.

Tips & Tricks

How to Use xtts-v2 on Eachlabs

Access xtts-v2 through Eachlabs Playground for instant testing—upload a 6-second reference WAV, enter text, select language and speaker from 17 options, and generate WAV outputs with streaming latency under 200ms. Integrate via Eachlabs API or SDK with parameters like speaker_wav, text, and language for voice conversion to file; fine-tuned models support custom emotion transfer on GPU.

---

Capabilities

Narration for audiobooks or educational content.

Voiceovers for videos and presentations.

Real-time communication in multilingual scenarios.

What Can I Use It For?

Use Cases for xtts-v2

Content creators building multilingual podcasts can upload a 6-second host clip and generate episodes in Spanish or Hindi, preserving the original speaker's emotional tone for authentic listener engagement without hiring translators.

Developers integrating voice-to-voice AI models into apps use xtts-v2's API to clone customer service voices across languages; for example, input "Welcome to our service, how may I assist?" with a reference WAV to output natural replies in Russian, enabling scalable global support bots.

Marketers producing localized ads feed a brand spokesperson's short audio and text prompt like "Promote our new eco-friendly shoes with excitement and a smile" to create dubbed versions in Dutch or Arabic, maintaining style for consistent campaigns.

Gaming studios leverage xtts-v2 for dynamic NPC voices: provide a character sample and script to clone into Portuguese, capturing whispers for stealth scenes or laughter for dialogues, accelerating localization without voice actor recuts.

Things to Be Aware Of

Multilingual Speech:

  • Input: "Bonjour, comment allez-vous?"
    Language: fr
    Output: High-quality French speech.

Voice Personalization:

  • Provide a custom speaker file to replicate a specific voice style.

Enhanced Cleanup:

  • Enable the cleanup_voice feature to polish the generated audio.

Limitations

Accent and Dialect Variations: The model may not fully replicate regional accents or dialects within a language.

Speaker Diversity: The quality of voice mimicry depends heavily on the provided speaker file's clarity and characteristics.

Complex Text Handling: Highly technical or domain-specific jargon may result in inconsistent pronunciation.

Output Format: WAV

Pricing

Pricing Detail

This model runs at a cost of $0.001540 per second.

The average execution time is 20 seconds, but this may vary depending on your input data.

The average cost per run is $0.030800

Pricing Type: Execution Time

Cost Per Second means the total cost is calculated based on how long the model runs. Instead of paying a fixed fee per run, you are charged for every second the model is actively processing. This pricing method provides flexibility, especially for models with variable execution times, because you only pay for the actual time used.