WHISPER
Whisper Large V3 Turbo delivers blazing-fast audio transcription with speaker diarization, converting conversations into accurate text with word- and sentence-level timestamps
Avg Run Time: 8.000s
Model Slug: whisper-diarization
Playground
Input
Enter a URL or choose a file from your computer.
Invalid URL.
(Max 50MB)
Output
Example Result
Preview and download your result.
API & SDK
Create a Prediction
Send a POST request to create a new prediction. This will return a prediction ID that you'll use to check the result. The request should include your model inputs and API key.
Get Prediction Result
Poll the prediction endpoint with the prediction ID until the result is ready. The API uses long-polling, so you'll need to repeatedly check until you receive a success status.
Readme
Overview
whisper-diarization is an open-source project that integrates OpenAI's Whisper speech-to-text model with speaker diarization capabilities, enabling automatic transcription of audio conversations with speaker identification and timestamps. Developed by the open-source community, it builds on Whisper Large V3 Turbo, which is an optimized variant of OpenAI's Whisper Large V3, reducing decoder layers from 32 to 4 for 6x faster inference while maintaining accuracy within 1-2% of the full model. The project has gained significant traction, with over 5.2k stars on GitHub, highlighting its popularity for robust speech recognition and diarization tasks.
Key features include blazing-fast audio transcription supporting 99+ languages, speaker diarization to distinguish multiple speakers in conversations, and word- and sentence-level timestamps for precise alignment. It leverages Whisper's transformer encoder-decoder architecture trained on 680,000 hours of multilingual web audio, making it robust to background noise, accents, and technical jargon. What makes it unique is the combination of Whisper Large V3 Turbo's efficiency (809 million parameters, ~6GB VRAM, 216x real-time speed on optimized hardware) with diarization, allowing seamless conversion of multi-speaker audio into structured, timestamped text without significant accuracy loss.
The underlying technology uses Whisper for transcription and integrates frameworks like Emilia for speaker diarization, source separation, and voice activity detection, processing audio into clean single-speaker clips with transcriptions. This makes whisper-diarization ideal for real-world applications requiring both speed and speaker attribution in diverse acoustic environments.
Technical Specifications
- Architecture: Transformer encoder-decoder (Whisper Large V3 Turbo with 4 decoder layers, integrated diarization)
- Parameters: 809 million (Whisper Large V3 Turbo base)
- Resolution: Mel-Spectrogram with 128 bins
- Input/Output formats: Audio input (supports 99+ languages), output as timestamped text with speaker labels
- Performance metrics: WER 7.75% on mixed benchmarks, 216x real-time inference speed, ~6GB VRAM, within 1-2% of Whisper Large V3 accuracy
Key Considerations
- Important factors: Ensure high-quality input audio; use GPU with at least 6GB VRAM for optimal speed
- Best practices: Apply voice cleaning and source separation preprocessing for better diarization accuracy
- Common pitfalls: Avoid low-quality or noisy audio without preprocessing, as it can increase insertion errors
- Quality vs speed trade-offs: Turbo variant sacrifices minimal accuracy (1-2%) for 6x speed over full Large V3
- Prompt engineering tips: Not applicable as it's transcription-focused; focus on audio chunking for long-form content to reduce repetitions
Tips & Tricks
- Optimal parameter settings: Use quantization for up to 6x faster inference with minimal accuracy loss
- Prompt structuring advice: Segment long audio into 30-second clips for processing to match training data patterns
- How to achieve specific results: Integrate with Emilia framework for source separation and VAD to improve diarization on multi-speaker audio
- Iterative refinement strategies: Post-process transcriptions with filtering (e.g., WER-based or DNSMOS >3) to ensure quality
- Advanced techniques: Combine with hybrid models for real-time streaming, using faster-whisper engines followed by diarization
Capabilities
- Accurate transcription of multilingual audio (99+ languages) with robustness to noise, accents, and jargon
- Speaker diarization to label and separate multiple speakers in conversations
- Word- and sentence-level timestamps for precise alignment and searchability
- High-speed inference (216x real-time on optimized setups, 6x faster than base Whisper Large V3)
- Versatile handling of diverse audio from web-scale training data (680k hours)
What Can I Use It For?
- Professional applications: Enterprise workflows for transcription, diarization, and summarization in multi-model pipelines
- Creative projects: Real-time voice-to-text tools with speaker attribution for content creation
- Business use cases: Offline batch processing of meetings or calls for analysis
- Personal projects: Local transcription setups on laptops (e.g., ~500ms latency on M4 MacBook with large-v3-turbo)
- Industry-specific applications: Speech role evaluation datasets with diarization and quality scoring
Things to Be Aware Of
- Experimental features: Integration with frameworks like Emilia for automated voice cleaning and annotation
- Known quirks: May produce fewer repetitions on long-form audio compared to base models
- Performance considerations: Achieves low latency (~500ms on modern hardware) but benefits from GPUs
- Resource requirements: ~6GB VRAM minimum; scales well with quantization
- Consistency factors: High robustness from diverse training data, performs well on out-of-distribution audio
- Positive user feedback themes: Praised for speed-accuracy balance and multilingual support in benchmarks
- Common concerns: English-focused optimizations in some variants; multilingual relies on Turbo
Limitations
- Primarily optimized for batch processing; streaming requires additional engineering for low latency
- Diarization accuracy depends on preprocessing; raw multi-speaker audio without separation may degrade performance
- Hardware-dependent speed; edge devices need lighter variants or distillation
Related AI Models
You can seamlessly integrate advanced AI capabilities into your applications without the hassle of managing complex infrastructure.

