ELEVENLABS
ElevenLabs Speech-to-Text Scribe v2 is a high-accuracy speech recognition model that converts audio into text with strong precision and multilingual support.
Avg Run Time: 20.000s
Model Slug: elevenlabs-speech-to-text-scribe-v2
Playground
Input
Output
Example Result
Preview and download your result.
API & SDK
Create a Prediction
Send a POST request to create a new prediction. This will return a prediction ID that you'll use to check the result. The request should include your model inputs and API key.
Get Prediction Result
Poll the prediction endpoint with the prediction ID until the result is ready. The API uses long-polling, so you'll need to repeatedly check until you receive a success status.
Readme
Overview
elevenlabs-speech-to-text-scribe-v2 — Voice-to-Text AI Model
elevenlabs-speech-to-text-scribe-v2, ElevenLabs' official Scribe v2 model, delivers the most accurate batch speech-to-text transcription for long-form audio, achieving the lowest word error rate on industry benchmarks across 90+ languages. This voice-to-text AI model excels at handling complex real-world conditions like pauses, tone changes, diverse accents, and extended silences, making it ideal for subtitling, captioning, and large-scale audio processing. Developers seeking ElevenLabs voice-to-text solutions for multilingual workflows find unmatched reliability in elevenlabs-speech-to-text-scribe-v2, which automates precise text conversion from audio inputs without manual segmentation.
Technical Specifications
What Sets elevenlabs-speech-to-text-scribe-v2 Apart
elevenlabs-speech-to-text-scribe-v2 stands out in the voice-to-text AI model landscape with its optimization for batch processing of long, complex recordings, outperforming predecessors in stability and accuracy for diverse speakers and noisy environments. This enables enterprises to transcribe hours of audio reliably, scaling subtitling for media libraries or compliance reviews without quality drops.
Keyterm prompting allows up to 100 custom words or phrases, using transcript context for precise insertion in technical domains like brand names or jargon—far beyond basic custom vocabulary. Users gain accurate handling of industry-specific terms, streamlining transcription for research or training content.
Native entity detection across 56 categories, including PII, health data, and payment details with exact timestamps, supports secure audio analysis. This facilitates automated redaction and compliance in global workflows, a critical edge for developers building Elevenlabs speech-to-text API integrations.
- Smart multi-language detection transcribes mixed-language audio files automatically, supporting 90+ languages with low WER in high-accuracy tiers like Hindi, Mandarin, and Spanish.
- Multichannel support for up to 5 channels assigns speaker IDs independently, ideal for meetings or podcasts.
- Output includes timed words, spacing, and audio events like laughter, in JSON format for easy parsing.
Key Considerations
- Ensure high-quality audio input for optimal word error rate performance, as noisy environments may impact accuracy
- Use keyterm prompting for domain-specific vocabulary like medical or brand terms to improve precision
- Balance realtime latency needs with batch processing for higher accuracy in non-live scenarios
- Test across languages early, as automatic multi-language detection works best with clear speaker separation
- Avoid common pitfalls like overloading prompts with too many keyterms, which can dilute focus
Tips & Tricks
How to Use elevenlabs-speech-to-text-scribe-v2 on Eachlabs
Access elevenlabs-speech-to-text-scribe-v2 seamlessly on Eachlabs via the Playground for instant testing, API for batch integrations, or SDK for custom apps. Upload audio files, set parameters like keyterms, entity detection categories, or multichannel mode, and receive JSON outputs with timed transcripts, words, and events in high-accuracy text.
---Capabilities
- High-accuracy transcription with lowest industry benchmark word error rates
- Realtime processing at sub-150ms latency with 93.5% accuracy for conversational AI
- Multilingual support for 90+ languages with automatic detection in single files
- Keyterm prompting for customized vocabulary biasing (e.g., product names, medical terms)
- Entity detection across 56 categories for structured output
- Versatile for both live and batch transcription with strong precision
What Can I Use It For?
Use Cases for elevenlabs-speech-to-text-scribe-v2
Media teams managing video libraries use elevenlabs-speech-to-text-scribe-v2 for batch subtitling, feeding long-form content with mixed accents and pauses to generate precise, timestamped transcripts across 90+ languages—perfect for global distribution without manual edits.
Developers building compliance tools leverage its entity detection, uploading audio files to automatically flag and timestamp sensitive data like SSNs or medical terms, enabling secure redaction in enterprise pipelines via the elevenlabs-speech-to-text-scribe-v2 API.
Researchers transcribing interviews apply keyterm prompting; for example, provide an audio file with the prompt specifying terms like "quantum entanglement" or "CRISPR variants," and the model contextually inserts them accurately, handling technical jargon in multilingual discussions.
Marketers creating multilingual campaigns process podcast episodes with smart language detection, converting diverse speaker audio into structured text for captioning, supporting use cases like automated content localization for international audiences.
Things to Be Aware Of
- Realtime variant excels in low-latency scenarios but may require optimized conditions for 30-80ms performance
- Strong positive feedback on benchmark-leading accuracy across 90+ languages in recent announcements
- Users note reliable entity detection for 56 categories, enhancing structured outputs
- Resource needs are efficient for realtime use, suitable for conversational applications
- Community highlights consistent outperforming of established models like Whisper
- Some discussions emphasize testing in noisy real-world audio for edge case robustness
Limitations
- Specific parameter counts and full architectural details not publicly available, limiting custom fine-tuning insights
- Performance in extremely noisy or accented speech may vary, though benchmarks show overall superiority; real-world testing recommended
- Primarily optimized for transcription accuracy and speed, with less emphasis on advanced post-processing features in current docs
Pricing
Pricing Type: Dynamic
$0.22/hour base rate
Current Pricing
Pricing Rules
| Condition | Pricing |
|---|---|
keyterms matches "undefined" | $0.33/hour with keyterms + entity detection (base $0.22 + 20% + 30% additive) |
entity_detection matches "undefined" | $0.286/hour with entity detection (base $0.22 + 30% entity premium) |
keyterms matches "undefined" | $0.264/hour with keyterms (base $0.22 + 20% keyterm premium) |
Rule 4(Active) | $0.22/hour base rate |
Related AI Models
You can seamlessly integrate advanced AI capabilities into your applications without the hassle of managing complex infrastructure.
Dev questions, real answers.
ElevenLabs Scribe v2 is an advanced speech-to-text model by ElevenLabs that delivers highly accurate audio transcription across multiple languages. It supports speaker diarization, punctuation, and timestamp generation, producing clean, structured transcripts suitable for professional and enterprise use.
ElevenLabs Scribe v2 is accessible through the eachlabs unified API. Submit an audio file; the model returns a structured JSON transcript with speaker labels and timestamps. Billing is pay-as-you-go through eachlabs no separate ElevenLabs subscription is required.
ElevenLabs Scribe v2 is best suited for podcast transcription, meeting notes generation, and multilingual audio indexing. Its speaker diarization capability makes it particularly valuable for interview transcription and multi-speaker content where attribution accuracy is important.
