ELEVENLABS
Accurately converts spoken audio into written text. Fast, reliable, and ideal for transcripts, captions, and voice-based input.
Official Partner
Avg Run Time: 10.000s
Model Slug: elevenlabs-speech-to-text
Playground
Input
Enter a URL or choose a file from your computer.
Invalid URL.
(Max 50MB)
Output
Example Result
Preview and download your result.
API & SDK
Create a Prediction
Send a POST request to create a new prediction. This will return a prediction ID that you'll use to check the result. The request should include your model inputs and API key.
Get Prediction Result
Poll the prediction endpoint with the prediction ID until the result is ready. The API uses long-polling, so you'll need to repeatedly check until you receive a success status.
Readme
Overview
elevenlabs-speech-to-text — Voice-to-Text AI Model
elevenlabs-speech-to-text, powered by ElevenLabs's Scribe v2 architecture, delivers ultra-accurate speech-to-text transcription across 90+ languages, solving the challenges of real-time audio processing for developers building voice agents and live captioning tools. This voice-to-text AI model excels in handling natural speech nuances like pauses, filler words, and accents, making it ideal for transcripts, subtitles, and conversational AI. With support for PCM audio from 8kHz to 48kHz and μ-law encoding, elevenlabs-speech-to-text ensures compatibility with telephony, web, and pro setups, enabling fast, reliable conversion of spoken audio into written text.
Technical Specifications
What Sets elevenlabs-speech-to-text Apart
elevenlabs-speech-to-text stands out in the voice-to-text AI models comparison with its Scribe v2 real-time capabilities, achieving 93.5% accuracy on the FLEURS benchmark and outperforming competitors like Google's Gemini Flash in low-latency scenarios. This enables developers to build responsive apps for live subtitling or agents without transcription delays.
Keyterm prompting biases the model toward specific terms like product names or jargon using up to 100 contextual cues, far surpassing basic vocabularies in other models. Users gain precise handling of domain-specific vocabulary in technical transcripts or medical dictations.
Predictive transcription and Voice Activity Detection (VAD) anticipate words, detect speech boundaries, and filter noise, supporting multichannel up to 5 channels without diarization. This powers robust Elevenlabs voice-to-text for noisy environments or multi-speaker telephony.
- 99+ Languages & Accents: Covers English, Hindi, Mandarin, and more with adaptive accuracy (high: <5% WER; others up to 10%).
- Real-Time Low Latency: Processes streaming audio with manual commit control and text conditioning for interruptions.
- Audio Formats: PCM 8-48kHz, μ-law; outputs words, spacing, and audio events like laughter.
Key Considerations
- Ensure high-quality audio input for optimal transcription accuracy; background noise and low sample rates can reduce performance
- Use language and accent settings to improve recognition for multilingual or accented speakers
- For real-time applications, leverage the streaming API for low-latency transcription
- Advanced features like voice cloning and AI dubbing require additional configuration and may impact processing speed
- Balance quality and speed by selecting appropriate model variants (e.g., Flash v2.5 for low latency)
- Avoid overloading the model with long, unsegmented audio files; segment audio for better results
- Prompt engineering: Provide clear context or speaker labels when transcribing multi-speaker audio
Tips & Tricks
How to Use elevenlabs-speech-to-text on Eachlabs
Access elevenlabs-speech-to-text seamlessly on Eachlabs via the Playground for instant testing—upload audio files (PCM 8-48kHz, μ-law), set keyterms, language, and multichannel options to get text outputs with words, spacing, and events. Integrate through the API or SDK for production apps, specifying prompts for biasing and webhooks for async results, delivering high-accuracy transcripts optimized for real-time or batch use.
---Capabilities
- Converts spoken audio to highly accurate written text across 32 languages
- Supports real-time transcription with low latency (as low as 75ms)
- Handles diverse accents and speech patterns with high context awareness
- Offers voice cloning and AI dubbing for customized voice outputs
- Provides a large library of voice profiles for expressive and emotive speech synthesis
- Delivers high-fidelity outputs suitable for professional transcripts, captions, and voice-based input
- Adaptable to various domains, including media, education, customer service, and accessibility
What Can I Use It For?
Use Cases for elevenlabs-speech-to-text
Developers building conversational AI agents can feed live audio streams into elevenlabs-speech-to-text via the elevenlabs-speech-to-text API, leveraging predictive transcription and keyterm prompting for names like "Scribe v2" to ensure accurate, context-aware responses even with accents or noise. This creates seamless voice assistants handling interruptions without losing coherence.
Content creators producing multilingual podcasts or videos use it for automated subtitling, uploading batch audio in supported formats to generate timed transcripts across 90+ languages, with VAD filtering background sounds for clean outputs. For example, input a Hindi interview clip with the prompt keyterms "AI transcription, low latency" to bias toward technical accuracy.
Enterprise teams in healthcare or call centers apply multichannel support for up to 5 lines, transcribing telephony μ-law audio independently per channel to log customer interactions reliably. It captures jargon like medication names precisely, streamlining compliance and analysis.
Marketers running global campaigns integrate elevenlabs-speech-to-text for real-time translation demos, processing live web audio with automatic language detection to produce captions in languages like Japanese or Swahili.
Things to Be Aware Of
- Some advanced features (e.g., low-latency models, voice cloning) may require higher-tier access or additional configuration
- Occasional synthetic artifacts or misrecognition in challenging audio conditions (e.g., heavy background noise)
- Users report best results with clean, high-quality audio and explicit language settings
- Streaming API enables real-time transcription but may require robust infrastructure for large-scale deployments
- Resource requirements can be significant for high-volume or high-fidelity applications
- Positive feedback highlights naturalness, emotional range, and multilingual versatility
- Common concerns include pricing for advanced features and occasional latency spikes in heavy usage scenarios
Limitations
- Requires high-quality audio input for optimal accuracy; performance degrades with noisy or low-resolution audio
- Not designed for deep knowledge base integration or post-call analytics; primarily focused on transcription and voice synthesis
- May not be optimal for highly specialized domains requiring domain-specific vocabulary or context-aware conversation management
Pricing
Pricing Detail
This model runs at a cost of $0.005500 per execution.
Pricing Type: Fixed
The cost remains the same regardless of which model you use or how long it runs. There are no variables affecting the price. It is a set, fixed amount per run, as the name suggests. This makes budgeting simple and predictable because you pay the same fee every time you execute the model.
Related AI Models
You can seamlessly integrate advanced AI capabilities into your applications without the hassle of managing complex infrastructure.
