Google · Text to Speech

Audio·google-tts·by Google

Google Text to Speech converts your written text into natural-sounding speech. Simply type your text, choose a voice, and generate high-quality audio instantly.

Runtime (p50)
10s
Estimated price
Usage-based
Call the API
prediction.sh
sh
curl -X POST \
  -H "X-API-Key: $EACHLABS_API_KEY" \
  -H "Content-Type: application/json" \
  --data '{
    "model": "google-text-to-speech",
    "version": "0.0.1",
    "input": {
        "mode": "single",
        "text": "Say in a natural and confident tone: Hello! Google Text to speech model is now live on Eachlabs. This is a voice test of the model. Try it yourself today.",
        "voice": "Despina",
        "speaker1_name": "Speaker 1",
        "speaker2_name": "Speaker 2",
        "speaker1_voice": "Zephyr",
        "speaker2_voice": "Puck"
    },
    "webhook_url": ""
}' \
  https://api.eachlabs.ai/v1/prediction/
Documentation8 sections
  • Overview

    Google | Text to Speech converts written text into natural-sounding speech using advanced AI models, solving the need for realistic audio in apps, content creation, and accessibility tools. Provided by Google as part of the google-tts family, it stands out with over 380 voices across 75+ languages, including premium Neural2 and WaveNet options for expressive, human-like output.

    This Google text-to-voice solution powers everything from Google Docs read-aloud to enterprise APIs, delivering low-latency synthesis with customizable pitch, speed, and emotion via SSML markup. Developers appreciate its seamless REST/gRPC integration and continuous AI upgrades, making it ideal for real-time or batch audio generation on each::labs.

    Whether building voice-enabled apps or enhancing e-learning, Google | Text to Speech ensures fluid, intelligible speech that engages global audiences without robotic tones.

  • Capabilities
    • Generates realistic speech with 380+ voices in 75+ languages using Neural2, WaveNet, and Studio models.
    • Customizes audio via SSML for pauses, emphasis, tone, emotion, and pronunciation control.
    • Supports multiple formats including MP3, WAV, OGG for web, telephony, or high-fidelity playback.
    • Offers adjustable parameters: speed (0.25-4x), pitch (±20 semitones), volume gain.
    • Enables custom voice training from studio audio for branded speech.
    • Provides synchronous/streaming synthesis for real-time apps or batch processing.
    • Integrates via REST/gRPC APIs with low-latency enterprise scalability.
    • Builds expressive, paced audio suitable for e-learning, IVR, and content creation.
  • Use cases

    Developers building apps: Integrate Google | Text to Speech API for real-time voice feedback. Example prompt: "<speak>Welcome. Your balance is <break time="300ms"/> $150.</speak>" using Neural2 voice for natural app narration.

    Content creators for e-learning: Generate multilingual lessons with WaveNet voices. Prompt: "<prosody rate="0.9">Photosynthesis converts light to energy.</prosody>" exports to MP3 for videos, leveraging 75+ language support.

    Marketers for announcements: Create branded IVR or ads with custom pitch. Example: "<prosody pitch="-2st" volume="+3dB">Sale ends soon—shop now!</prosody>" in OGG for web streaming.

    Accessibility designers: Power Google Docs read-aloud or Android hands-free with adjustable speed/pitch, ensuring inclusive experiences via each::labs deployment.

  • Tips & tricks

    Optimize prompts with SSML for precise control: use <break time="1s"/> for pauses, <prosody rate="slow" pitch="-2st"> for formal tones, enhancing natural flow in Google | Text to Speech.

    Select Neural2 voices like "en-US-Neural2-F" for female clarity or "en-US-Neural2-D" for dynamic range; adjust speaking_rate to 0.85 for announcements or 1.1 for upbeat notifications.

    Example 1: "<speak>Please be advised <break time="500ms"/> that system maintenance begins at midnight.</speak>" yields a professional pause.

    Example 2: "<prosody rate="1.1" pitch="+1st">Great news! Your order shipped.</prosody>" creates energetic delivery.

    Example 3: Test sample rates at 24kHz for high-quality WAV exports. Integrate via each::labs for streamlined Google | Text to Speech API workflows, iterating with short texts first.

  • Technical spec
    • Voices: 380+ across 75+ languages and variants, including Standard, WaveNet, Neural2, and Studio tiers for varying quality levels.
    • Input: Plain text or SSML for pauses, pronunciation, tone, and emotion control.
    • Output Formats: MP3, WAV (LINEAR16), OGG_OPUS, MULAW, ALAW; sample rates up to 24kHz.
    • Customization: Speaking rate (0.25-4.0), pitch (-20 to +20 semitones), volume gain (-96 to +16 dB).
    • API Support: REST, gRPC; synchronous and streaming synthesis for real-time or batch use.
    • Processing: Low latency; handles enterprise-scale volumes with quick response times.
    • Custom Voices: Train models with studio-quality audio recordings.

    Access via Google Cloud Text-to-Speech API on each::labs for high-fidelity audio output.

  • Things to be aware of

    Google | Text to Speech may sound monotone in extended passages, especially Standard voices; opt for Neural2/Studio to mitigate.

    Edge cases include complex SSML overuse causing synthesis errors—test incrementally. High-volume requests need quota monitoring to avoid throttling.

    Common mistakes: Ignoring language codes leads to mismatched accents; always specify like "en-US-Neural2-F". Resource needs are low, but API calls require stable internet and authentication.

    On Android/Docs, third-party engines can override defaults—verify Google Speech Recognition and Synthesis is active.

  • Key considerations

    Before using Google | Text to Speech, ensure a Google Cloud account for API access, with free tiers available and paid pricing at $0.004–$0.016 per 1k characters—WaveNet/Neural2 voices cost more for superior quality.

    Best for developer APIs, accessibility, and multilingual projects where integration with Google services matters; choose Standard voices for cost savings or Neural2/Studio for professional media.

    Prerequisites include API keys and basic coding knowledge for Python/Node.js clients. Tradeoffs favor quality over emotion depth in long-form content compared to specialized TTS tools.

  • Limitations

    Google | Text to Speech lacks deep emotional prosody in longer content, sounding less dynamic than specialized tools; Studio voices help but cost 10x more and support fewer languages.

    No zero-shot voice cloning without custom training data. Character limits apply per request; batch long texts via streaming.

    Telephony formats like MULAW suit narrowband but reduce quality. Free tier has usage caps; enterprise scale requires paid plans.

     

Related models

4 models
* FAQ

About Google · Text to Speech

01 / 03

What is Google Text to Speech?

Google Text to Speech is a neural text-to-voice model developed by Google that converts written text into natural-sounding audio. It supports multiple languages, accents, and voice types including Standard and WaveNet voices, producing high-quality audio suitable for applications, notifications, and accessibility tools.