XTTS
xtts-v2
XTTS is a Voice generation model that lets you clone voices into different languages by using just a quick 6-second audio clip.
Model Information
Input
Configure model parameters
Output
View generated results
Result
Preview, share or download your results with a single click.
Prerequisites
- Create an API Key from the Eachlabs Console
- Install the required dependencies for your chosen language (e.g., requests for Python)
API Integration Steps
1. Create a Prediction
Send a POST request to create a new prediction. This will return a prediction ID that you'll use to check the result. The request should include your model inputs and API key.
import requestsimport timeAPI_KEY = "YOUR_API_KEY" # Replace with your API keyHEADERS = {"X-API-Key": API_KEY,"Content-Type": "application/json"}def create_prediction():response = requests.post("https://api.eachlabs.ai/v1/prediction/",headers=HEADERS,json={"model": "xtts-v2","version": "0.0.1","input": {"text": "Hello, you are now at Eachlabs AI. If you need any support, just contact us.","speaker": "your_file.audio/mp3","language": "en","cleanup_voice": "true"}})prediction = response.json()if prediction["status"] != "success":raise Exception(f"Prediction failed: {prediction}")return prediction["predictionID"]
2. Get Prediction Result
Poll the prediction endpoint with the prediction ID until the result is ready. The API uses long-polling, so you'll need to repeatedly check until you receive a success status.
def get_prediction(prediction_id):while True:result = requests.get(f"https://api.eachlabs.ai/v1/prediction/{prediction_id}",headers=HEADERS).json()if result["status"] == "success":return resultelif result["status"] == "error":raise Exception(f"Prediction failed: {result}")time.sleep(1) # Wait before polling again
3. Complete Example
Here's a complete example that puts it all together, including error handling and result processing. This shows how to create a prediction and wait for the result in a production environment.
try:# Create predictionprediction_id = create_prediction()print(f"Prediction created: {prediction_id}")# Get resultresult = get_prediction(prediction_id)print(f"Output URL: {result['output']}")print(f"Processing time: {result['metrics']['predict_time']}s")except Exception as e:print(f"Error: {e}")
Additional Information
- The API uses a two-step process: create prediction and poll for results
- Response time: ~20 seconds
- Rate limit: 60 requests/minute
- Concurrent requests: 10 maximum
- Use long-polling to check prediction status until completion
Overview
XTTS is a state-of-the-art text-to-speech (TTS) model that enables high-quality, natural-sounding voice generation in multiple languages. The model is designed for generating lifelike speech while maintaining clarity, emotion, and linguistic precision. It supports a wide range of languages and offers fine-tuned controls to customize voice output to suit various use cases.
Technical Specifications
Multilingual Support: The model supports the following languages:
- English (en), Spanish (es), French (fr), German (de), Italian (it), Portuguese (pt), Polish (pl), Turkish (tr), Russian (ru), Dutch (nl), Czech (cs), Arabic (ar), Chinese (zh), Hungarian (hu), Korean (ko), Hindi (hi).
Speaker Personalization: Allows the use of external speaker files to mimic specific voice profiles or styles.
Voice Cleanup: A refinement process to enhance the smoothness and quality of generated speech.
Key Considerations
Language-Specific Nuances: Ensure the text input aligns with the selected language to avoid unnatural pronunciation.
Speaker File Quality: Poor-quality or noisy speaker files can negatively impact the generated output. Use clean recordings for better results.
Output Clarity: Long or overly complex text inputs may produce less natural results.
Tips & Tricks
Text:
- Keep sentences concise and grammatically correct.
- Avoid abbreviations or symbols that may confuse the model.
- Example: Use "Please proceed to the next step." instead of "Pls proc nxt step."
Speaker:
- Use high-resolution audio files for better mimicry.
- Ensure the recording has a neutral tone without excessive background noise or distortion.
Language:
- Select the correct code for the desired language (e.g., en for English, fr for French).
- Match the text language with the selected language code for natural intonation.
Cleanup Voice:
- Enable this option for smoother and artifact-free outputs, especially when working with synthesized or noisy speaker profiles.
Capabilities
Narration for audiobooks or educational content.
Voiceovers for videos and presentations.
Real-time communication in multilingual scenarios.
What can I use for?
Creating customized voice profiles for specific use cases.
Generating speech in multiple languages with high clarity and natural tone.
Refining synthesized speech using advanced cleanup features.
Things to be aware of
Multilingual Speech:
- Input: "Bonjour, comment allez-vous?"
Language: fr
Output: High-quality French speech.
Voice Personalization:
- Provide a custom speaker file to replicate a specific voice style.
Enhanced Cleanup:
- Enable the cleanup_voice feature to polish the generated audio.
Limitations
Accent and Dialect Variations: The model may not fully replicate regional accents or dialects within a language.
Speaker Diversity: The quality of voice mimicry depends heavily on the provided speaker file's clarity and characteristics.
Complex Text Handling: Highly technical or domain-specific jargon may result in inconsistent pronunciation.
Output Format: WAV