NOVA-3
Deepgram Nova-3 accurately transcribes pre-recorded audio with word-level timestamps, speaker diarization, and automatic language detection.
Avg Run Time: 0.000s
Model Slug: deepgram-nova-3-speech-to-text
Playground
Input
Output
Example Result
Preview and download your result.
API & SDK
Create a Prediction
Send a POST request to create a new prediction. This will return a prediction ID that you'll use to check the result. The request should include your model inputs and API key.
Get Prediction Result
Poll the prediction endpoint with the prediction ID until the result is ready. The API uses long-polling, so you'll need to repeatedly check until you receive a success status.
Readme
Overview
Deepgram | Nova-3 | Speech to Text Overview
Deepgram | Nova-3 | Speech to Text is a high-performance speech recognition API that converts audio into accurate text transcriptions with word-level timestamps and speaker identification. Built by Deepgram, Nova-3 represents the latest generation of their speech-to-text technology, designed specifically for real-time and streaming applications where latency matters. The model's primary differentiator is its exceptional speed: Nova-3 achieves 441.6x real-time processing with sub-300ms streaming latency, making it the fastest commercial speech-to-text solution available. This combination of accuracy and speed makes Deepgram | Nova-3 | Speech to Text ideal for voice agents, IVR systems, and any application requiring conversational responsiveness.
Technical Specifications
Technical Specifications
- Audio input: PCM int16 format at 16kHz sample rate, streamed in 20ms chunks (640 bytes)
- Processing speed: 441.6x real-time performance
- Streaming latency: Sub-300ms pipeline latency for transcription start-to-finish
- Connection: Persistent WebSocket for continuous audio streaming
- Output: Partial transcripts (real-time feedback) and final transcripts (confirmed results) with speech-final detection
- Word Error Rate (WER): 5.26% accuracy on standard benchmarks
- Supported features: Automatic language detection, word-level timestamps, speaker diarization
Key Considerations
Key Considerations
Deepgram | Nova-3 | Speech to Text excels in scenarios requiring immediate response times, where latency below 300ms is critical for natural conversation flow. The model is optimized for streaming audio rather than batch processing of pre-recorded files. Cost-effectiveness is a major advantage at $4.30 per 1,000 minutes of transcription, making it suitable for high-volume applications. However, if absolute maximum accuracy is the only priority and latency is not a constraint, alternatives like ElevenLabs Scribe v2 (2.3% WER) may offer marginal improvements. For voice agents and IVR systems, Nova-3's latency advantage outweighs the 3-point WER gap versus higher-accuracy competitors.
Tips & Tricks
Tips and Tricks
To maximize Deepgram | Nova-3 | Speech to Text performance, leverage its streaming capability by sending audio in consistent 20ms chunks—this allows transcription to begin before the user finishes speaking, reducing perceived latency. Use the partial transcript feature (is_final=False) for UI feedback and real-time user indication, while reserving final transcripts (is_final=True) for LLM processing. Implement Voice Activity Detection (VAD) alongside Nova-3 to determine when users stop speaking, enabling efficient turn-taking in conversational applications. For example, configure your system to listen with the prompt: "Stream audio continuously and return partial results for UI updates, then send final confirmed transcripts to the language model." Pair Nova-3 with a low-latency TTS system like Cartesia Sonic 3 (40ms TTFA) to achieve round-trip latency under 500ms for truly conversational voice agents.
Capabilities
Capabilities
- Real-time streaming speech-to-text with sub-300ms latency
- Automatic language detection across multiple languages
- Word-level timestamps for precise audio-text alignment
- Speaker diarization to identify and separate multiple speakers
- Partial and final transcript modes for flexible application design
- Speech-final detection to identify when users stop speaking
- Persistent WebSocket connections for continuous audio streaming
- Intent recognition, sentiment analysis, and summarization through Deepgram's extended API
What Can I Use It For?
Use Cases for Deepgram | Nova-3 | Speech to Text
Voice Agents and Chatbots: Developers building ChatGPT-style voice assistants can pair Deepgram | Nova-3 | Speech to Text with an LLM and TTS system to create conversational AI with sub-500ms round-trip latency. The streaming capability means transcription begins before users finish speaking, creating a natural back-and-forth interaction. Example: "Build a customer service voice agent that transcribes caller speech in real-time and responds within 500ms."
Interactive Voice Response (IVR) Systems: Telecommunications and healthcare providers can replace traditional DTMF-based systems with natural speech understanding. Nova-3's accuracy and speed enable callers to speak naturally without awkward pauses. Example: "Deploy an IVR system for appointment scheduling that understands spoken dates, times, and patient names with minimal latency."
Live Transcription and Accessibility: Content creators and event organizers can use Deepgram | Nova-3 | Speech to Text to generate real-time captions for live streams, meetings, and presentations. Word-level timestamps enable precise synchronization with video. Example: "Transcribe a live webinar with speaker identification and generate searchable, timestamped transcripts."
Contact Center Analytics: Enterprise call centers can process billions of call minutes annually through Nova-3 to extract insights, improve agent training, and ensure compliance. The model's accuracy improvement (2-4x better alphanumeric transcription) directly impacts quality assurance. Example: "Analyze recorded customer calls to identify common issues and measure agent performance."
Things to Be Aware Of
Things to Be Aware Of
Deepgram | Nova-3 | Speech to Text is optimized for streaming audio and real-time applications; batch processing of large pre-recorded files may not fully leverage its speed advantages. Audio quality significantly impacts accuracy—background noise, poor microphone quality, or heavy accents may reduce WER performance. The model requires consistent 16kHz PCM audio input; format conversion overhead can add latency if not handled efficiently. WebSocket connections must remain persistent for optimal streaming performance; connection drops require re-establishment. While Nova-3 achieves 5.26% WER, this represents average performance; specialized domains (medical terminology, technical jargon) may require fine-tuning or post-processing for production accuracy.
Limitations
Limitations
Deepgram | Nova-3 | Speech to Text cannot guarantee perfect accuracy in all scenarios, particularly with heavy accents, multiple simultaneous speakers, or extreme background noise. The model's 5.26% WER, while competitive, trails specialized high-accuracy alternatives in controlled environments. Speaker diarization works best with clearly separated speakers and may struggle with overlapping speech. The model does not perform real-time speaker identification (matching speakers to known identities); it only separates different speakers. Language detection is automatic but may misidentify mixed-language audio. Processing latency, while sub-300ms for streaming, cannot be reduced below hardware and network constraints.
---Related AI Models
You can seamlessly integrate advanced AI capabilities into your applications without the hassle of managing complex infrastructure.
