inference · 1.6s

Kokoro 82M

Audio·kokoro·by Kokoro

Kokoro 82M is an advanced text-to-speech AI model designed to convert written text into natural-sounding voice output.

Try it now →

API reference

Runtime (p50): 21s
Estimated price: $0.000247 / sec

Call the API

prediction.sh

curl -X POST \
  -H "X-API-Key: $EACHLABS_API_KEY" \
  -H "Content-Type: application/json" \
  --data '{
    "model": "kokoro-82m",
    "version": "0.0.1",
    "input": {
        "text": "Hi,  welcome to Eachlabs AI! We are here to help you discover the power of artificial intelligence and provide you with the best experience.",
        "speed": 0.95,
        "voice": "am_michael"
    },
    "webhook_url": ""
}' \
  https://api.eachlabs.ai/v1/prediction/

Documentation8 sections

Overview
kokoro-82m — Text-to-Voice AI Model

kokoro-82m from Kokoro delivers compact, high-performance text-to-speech synthesis, converting written text into natural-sounding audio with remarkable efficiency on edge devices. This 82 million parameter model stands out by achieving 1,100 tokens per second inference speed on NVIDIA Jetson T4000 hardware, enabling real-time voice generation where larger TTS systems falter. Developed as part of the kokoro family, kokoro-82m powers developers seeking kokoro-82m API integration for low-latency applications like robotics and embedded systems, trained on under 100 hours of audio for multilingual support.

Ideal for users searching for "open source text to speech software" or "best text-to-voice AI model," kokoro-82m prioritizes speed and naturalness in resource-constrained environments, making it a go-to for on-device voice output without cloud dependency.
Capabilities
- High-Quality Synthesis: Produces lifelike, natural-sounding speech that closely mimics human intonation and rhythm.
- Flexible Parameter Control: Enables users to tailor outputs with adjustable speed and diverse voice options.
Use cases
Use Cases for kokoro-82m

Robotics developers integrate kokoro-82m for real-time voice responses, feeding prompts like "Status: battery at 75%, navigation complete" to generate natural alerts on NVIDIA Jetson edge devices, leveraging its 1,100 tokens/sec speed for lag-free interaction.

App builders creating "open source text to speech software" for mobile use ONNX runtime with kokoro-82m to read notes aloud in multiple languages, converting e-books or user input into audio without cloud latency, trained efficiently on minimal data.

Embedded system designers for industrial IoT use kokoro-82m in voice-enabled inspectors, synthesizing multilingual instructions from short text inputs to guide workers hands-free, capitalizing on its compact size for low-power deployment.

Content creators searching "TTS with kokoro" embed it in tools for quick audiobook prototypes, turning scripts into natural speech for testing narration styles across languages before full production.
Tips & tricks
How to Use kokoro-82m on Eachlabs

Access kokoro-82m seamlessly through Eachlabs Playground for instant text-to-voice testing, API for production-scale apps, or SDK for custom integrations. Input simple text prompts with language options, and receive high-quality WAV audio outputs optimized for natural flow and edge speed—perfect for developers building low-latency Kokoro text-to-voice solutions.
---
Technical spec
What Sets kokoro-82m Apart

kokoro-82m differentiates itself in the text-to-voice landscape through its ultra-compact 82M parameter size paired with top-tier inference performance, hitting 1,100 tokens/second on NVIDIA Jetson T4000—far surpassing typical TTS models in edge AI benchmarks. This enables seamless real-time synthesis on power-limited hardware, allowing developers to deploy Kokoro text-to-voice capabilities in robotics without performance trade-offs.

Unlike bulkier TTS systems requiring extensive training data, kokoro-82m produces natural-sounding speech from just under 100 hours of audio, supporting multiple languages in a lightweight footprint compatible with ONNX runtime. Users benefit from quick deployment in local neural TTS systems, ideal for "TTS with kokoro and onnx runtime" setups that prioritize efficiency over scale.
- Edge-Optimized Speed: Delivers 1,100 tokens/sec on Jetson T4000, enabling live voice feedback in robots or IoT devices— a benchmark edge over larger models like Qwen or Nemotron.
- Minimal Training Data: Achieves high-quality, multilingual output with <100 hours of audio, perfect for custom fine-tuning in open source text to speech projects.
- ONNX Compatibility: Runs efficiently via ONNX runtime, supporting fast local inference for "text-to-speech AI model" integrations without heavy dependencies.
Input accepts plain text prompts with optional language tags; outputs standard audio formats like WAV, with average processing under 1 second for short phrases on optimized hardware.
Things to be aware of
- Create a fast-paced announcement by setting the speed to 1.3 and using concise text.
- Generate an audiobook snippet by selecting a steady speed (e.g., 1.0) and a calm voice.
- Test how punctuation affects output by trying variations like pauses (commas) or emphasis (exclamation points).
Key considerations
- Text Structure Matters: Ensure that the input text is grammatically correct and well-structured to produce the best audio output.
- Speed Extremes: Setting the speed parameter too high or low may affect intelligibility. Moderate adjustments are recommended.
- Output Consistency: Shorter sentences and clear punctuation improve clarity and reduce the risk of unnatural pauses.
Limitations
- Text Complexity: While highly capable, overly intricate or poorly formatted text may result in suboptimal audio.
- Speed and Comprehension: Extreme speed settings can hinder clarity and make the output difficult to understand.
- Voice Availability: The pre-trained voices, while diverse, might not cover every niche use case or accent preference.
Output Format: WAV