Eachlabs | AI Workflows for app builders
deepgram-nova-3-speech-to-text

NOVA-3

Deepgram Nova-3 accurately transcribes pre-recorded audio with word-level timestamps, speaker diarization, and automatic language detection.

Avg Run Time: 0.000s

Model Slug: deepgram-nova-3-speech-to-text

Playground

Input

Enter a URL or choose a file from your computer.

Advanced Controls

Output

Example Result

Preview and download your result.

{
"output":"{"metadata":{"duration":25.9},"results":{"channels":[{"detected_language":"en","alternatives":[{"transcript":"Yeah. As as much as, it's worth celebrating...","confidence":0.99,"words":[{"word":"yeah","start":0.0,"end":0.48,"confidence":0.99,"speaker":0}]}]}]}""
}
Nova-3 Monolingual: $0.00435/min (per-second, from provider response)

API & SDK

Create a Prediction

Send a POST request to create a new prediction. This will return a prediction ID that you'll use to check the result. The request should include your model inputs and API key.

Get Prediction Result

Poll the prediction endpoint with the prediction ID until the result is ready. The API uses long-polling, so you'll need to repeatedly check until you receive a success status.

Readme

Table of Contents
Overview
Technical Specifications
Key Considerations
Tips & Tricks
Capabilities
What Can I Use It For?
Things to Be Aware Of
Limitations

Overview

Deepgram | Nova-3 | Speech to Text Overview

Deepgram | Nova-3 | Speech to Text is a high-performance speech recognition API that converts audio into accurate text transcriptions with word-level timestamps and speaker identification. Built by Deepgram, Nova-3 represents the latest generation of their speech-to-text technology, designed specifically for real-time and streaming applications where latency matters. The model's primary differentiator is its exceptional speed: Nova-3 achieves 441.6x real-time processing with sub-300ms streaming latency, making it the fastest commercial speech-to-text solution available. This combination of accuracy and speed makes Deepgram | Nova-3 | Speech to Text ideal for voice agents, IVR systems, and any application requiring conversational responsiveness.

Technical Specifications

Technical Specifications
  • Audio input: PCM int16 format at 16kHz sample rate, streamed in 20ms chunks (640 bytes)
  • Processing speed: 441.6x real-time performance
  • Streaming latency: Sub-300ms pipeline latency for transcription start-to-finish
  • Connection: Persistent WebSocket for continuous audio streaming
  • Output: Partial transcripts (real-time feedback) and final transcripts (confirmed results) with speech-final detection
  • Word Error Rate (WER): 5.26% accuracy on standard benchmarks
  • Supported features: Automatic language detection, word-level timestamps, speaker diarization

Key Considerations

Key Considerations

Deepgram | Nova-3 | Speech to Text excels in scenarios requiring immediate response times, where latency below 300ms is critical for natural conversation flow. The model is optimized for streaming audio rather than batch processing of pre-recorded files. Cost-effectiveness is a major advantage at $4.30 per 1,000 minutes of transcription, making it suitable for high-volume applications. However, if absolute maximum accuracy is the only priority and latency is not a constraint, alternatives like ElevenLabs Scribe v2 (2.3% WER) may offer marginal improvements. For voice agents and IVR systems, Nova-3's latency advantage outweighs the 3-point WER gap versus higher-accuracy competitors.

Tips & Tricks

Tips and Tricks

To maximize Deepgram | Nova-3 | Speech to Text performance, leverage its streaming capability by sending audio in consistent 20ms chunks—this allows transcription to begin before the user finishes speaking, reducing perceived latency. Use the partial transcript feature (is_final=False) for UI feedback and real-time user indication, while reserving final transcripts (is_final=True) for LLM processing. Implement Voice Activity Detection (VAD) alongside Nova-3 to determine when users stop speaking, enabling efficient turn-taking in conversational applications. For example, configure your system to listen with the prompt: "Stream audio continuously and return partial results for UI updates, then send final confirmed transcripts to the language model." Pair Nova-3 with a low-latency TTS system like Cartesia Sonic 3 (40ms TTFA) to achieve round-trip latency under 500ms for truly conversational voice agents.

Capabilities

Capabilities
  • Real-time streaming speech-to-text with sub-300ms latency
  • Automatic language detection across multiple languages
  • Word-level timestamps for precise audio-text alignment
  • Speaker diarization to identify and separate multiple speakers
  • Partial and final transcript modes for flexible application design
  • Speech-final detection to identify when users stop speaking
  • Persistent WebSocket connections for continuous audio streaming
  • Intent recognition, sentiment analysis, and summarization through Deepgram's extended API

What Can I Use It For?

Use Cases for Deepgram | Nova-3 | Speech to Text

Voice Agents and Chatbots: Developers building ChatGPT-style voice assistants can pair Deepgram | Nova-3 | Speech to Text with an LLM and TTS system to create conversational AI with sub-500ms round-trip latency. The streaming capability means transcription begins before users finish speaking, creating a natural back-and-forth interaction. Example: "Build a customer service voice agent that transcribes caller speech in real-time and responds within 500ms."

Interactive Voice Response (IVR) Systems: Telecommunications and healthcare providers can replace traditional DTMF-based systems with natural speech understanding. Nova-3's accuracy and speed enable callers to speak naturally without awkward pauses. Example: "Deploy an IVR system for appointment scheduling that understands spoken dates, times, and patient names with minimal latency."

Live Transcription and Accessibility: Content creators and event organizers can use Deepgram | Nova-3 | Speech to Text to generate real-time captions for live streams, meetings, and presentations. Word-level timestamps enable precise synchronization with video. Example: "Transcribe a live webinar with speaker identification and generate searchable, timestamped transcripts."

Contact Center Analytics: Enterprise call centers can process billions of call minutes annually through Nova-3 to extract insights, improve agent training, and ensure compliance. The model's accuracy improvement (2-4x better alphanumeric transcription) directly impacts quality assurance. Example: "Analyze recorded customer calls to identify common issues and measure agent performance."

Things to Be Aware Of

Things to Be Aware Of

Deepgram | Nova-3 | Speech to Text is optimized for streaming audio and real-time applications; batch processing of large pre-recorded files may not fully leverage its speed advantages. Audio quality significantly impacts accuracy—background noise, poor microphone quality, or heavy accents may reduce WER performance. The model requires consistent 16kHz PCM audio input; format conversion overhead can add latency if not handled efficiently. WebSocket connections must remain persistent for optimal streaming performance; connection drops require re-establishment. While Nova-3 achieves 5.26% WER, this represents average performance; specialized domains (medical terminology, technical jargon) may require fine-tuning or post-processing for production accuracy.

Limitations

Limitations

Deepgram | Nova-3 | Speech to Text cannot guarantee perfect accuracy in all scenarios, particularly with heavy accents, multiple simultaneous speakers, or extreme background noise. The model's 5.26% WER, while competitive, trails specialized high-accuracy alternatives in controlled environments. Speaker diarization works best with clearly separated speakers and may struggle with overlapping speech. The model does not perform real-time speaker identification (matching speakers to known identities); it only separates different speakers. Language detection is automatic but may misidentify mixed-language audio. Processing latency, while sub-300ms for streaming, cannot be reduced below hardware and network constraints.

---