How do I use Deepgram Nova-3 Speech to Text via API?

Deepgram Nova-3 Speech to Text is accessible through the eachlabs unified API. Provide an audio file or URL; the model returns a JSON transcript with word-level timing data. Billing is pay-as-you-go through eachlabs with no separate Deepgram account required.

What is the difference between Deepgram Nova-3 and Nova-3 Pro?

Deepgram Nova-3 is the standard transcription model suited for most production use cases, offering high accuracy and fast processing. Nova-3 Pro extends this with enhanced punctuation, improved speaker confidence scoring, and better performance in challenging audio conditions. Both are available on eachlabs API.

Example output

{"metadata":{"duration":25.9},"results":{"channels":[{"detected_language":"en","alternatives":[{"transcript":"Yeah. As as much as, it's worth celebrating...","confidence":0.99,"words":[{"word":"yeah","start":0.0,"end":0.48,"confidence":0.99,"speaker":0}]}]}]}"

inference · 2.4s

Deepgram Nova-3 · Speech to Text

Object·nova-3·by Deepgram

Deepgram Nova-3 accurately transcribes pre-recorded audio with word-level timestamps, speaker diarization, and automatic language detection.

Try it now →

API reference

Runtime (p50): 10s
Estimated price: From $0.000073

Call the API

prediction.sh

curl -X POST \
  -H "X-API-Key: $EACHLABS_API_KEY" \
  -H "Content-Type: application/json" \
  --data '{
    "model": "deepgram-nova-3-speech-to-text",
    "version": "0.0.1",
    "input": {
        "model": "nova-3",
        "diarize": true,
        "media_url": "https://storage.googleapis.com/magicpoint/inputs/deepgram-nova-3-stt-input.mp3",
        "punctuate": true,
        "multichannel": false,
        "smart_format": true,
        "language_code": "auto"
    },
    "webhook_url": ""
}' \
  https://api.eachlabs.ai/v1/prediction/

Documentation8 sections

Overview
Deepgram | Nova-3 | Speech to Text Overview

Deepgram | Nova-3 | Speech to Text is a high-performance speech recognition API that converts audio into accurate text transcriptions with word-level timestamps and speaker identification. Built by Deepgram, Nova-3 represents the latest generation of their speech-to-text technology, designed specifically for real-time and streaming applications where latency matters. The model's primary differentiator is its exceptional speed: Nova-3 achieves 441.6x real-time processing with sub-300ms streaming latency, making it the fastest commercial speech-to-text solution available. This combination of accuracy and speed makes Deepgram | Nova-3 | Speech to Text ideal for voice agents, IVR systems, and any application requiring conversational responsiveness.
Capabilities
Capabilities
- Real-time streaming speech-to-text with sub-300ms latency
- Automatic language detection across multiple languages
- Word-level timestamps for precise audio-text alignment
- Speaker diarization to identify and separate multiple speakers
- Partial and final transcript modes for flexible application design
- Speech-final detection to identify when users stop speaking
- Persistent WebSocket connections for continuous audio streaming
- Intent recognition, sentiment analysis, and summarization through Deepgram's extended API
Use cases
Use Cases for Deepgram | Nova-3 | Speech to Text

Voice Agents and Chatbots: Developers building ChatGPT-style voice assistants can pair Deepgram | Nova-3 | Speech to Text with an LLM and TTS system to create conversational AI with sub-500ms round-trip latency. The streaming capability means transcription begins before users finish speaking, creating a natural back-and-forth interaction. Example: "Build a customer service voice agent that transcribes caller speech in real-time and responds within 500ms."

Interactive Voice Response (IVR) Systems: Telecommunications and healthcare providers can replace traditional DTMF-based systems with natural speech understanding. Nova-3's accuracy and speed enable callers to speak naturally without awkward pauses. Example: "Deploy an IVR system for appointment scheduling that understands spoken dates, times, and patient names with minimal latency."

Live Transcription and Accessibility: Content creators and event organizers can use Deepgram | Nova-3 | Speech to Text to generate real-time captions for live streams, meetings, and presentations. Word-level timestamps enable precise synchronization with video. Example: "Transcribe a live webinar with speaker identification and generate searchable, timestamped transcripts."

Contact Center Analytics: Enterprise call centers can process billions of call minutes annually through Nova-3 to extract insights, improve agent training, and ensure compliance. The model's accuracy improvement (2-4x better alphanumeric transcription) directly impacts quality assurance. Example: "Analyze recorded customer calls to identify common issues and measure agent performance."
Tips & tricks
Tips and Tricks

To maximize Deepgram | Nova-3 | Speech to Text performance, leverage its streaming capability by sending audio in consistent 20ms chunks—this allows transcription to begin before the user finishes speaking, reducing perceived latency. Use the partial transcript feature (is_final=False) for UI feedback and real-time user indication, while reserving final transcripts (is_final=True) for LLM processing. Implement Voice Activity Detection (VAD) alongside Nova-3 to determine when users stop speaking, enabling efficient turn-taking in conversational applications. For example, configure your system to listen with the prompt: "Stream audio continuously and return partial results for UI updates, then send final confirmed transcripts to the language model." Pair Nova-3 with a low-latency TTS system like Cartesia Sonic 3 (40ms TTFA) to achieve round-trip latency under 500ms for truly conversational voice agents.
Technical spec
Technical Specifications
- Audio input: PCM int16 format at 16kHz sample rate, streamed in 20ms chunks (640 bytes)
- Processing speed: 441.6x real-time performance
- Streaming latency: Sub-300ms pipeline latency for transcription start-to-finish
- Connection: Persistent WebSocket for continuous audio streaming
- Output: Partial transcripts (real-time feedback) and final transcripts (confirmed results) with speech-final detection
- Word Error Rate (WER): 5.26% accuracy on standard benchmarks
- Supported features: Automatic language detection, word-level timestamps, speaker diarization
Things to be aware of
Things to Be Aware Of

Deepgram | Nova-3 | Speech to Text is optimized for streaming audio and real-time applications; batch processing of large pre-recorded files may not fully leverage its speed advantages. Audio quality significantly impacts accuracy—background noise, poor microphone quality, or heavy accents may reduce WER performance. The model requires consistent 16kHz PCM audio input; format conversion overhead can add latency if not handled efficiently. WebSocket connections must remain persistent for optimal streaming performance; connection drops require re-establishment. While Nova-3 achieves 5.26% WER, this represents average performance; specialized domains (medical terminology, technical jargon) may require fine-tuning or post-processing for production accuracy.
Key considerations
Key Considerations

Deepgram | Nova-3 | Speech to Text excels in scenarios requiring immediate response times, where latency below 300ms is critical for natural conversation flow. The model is optimized for streaming audio rather than batch processing of pre-recorded files. Cost-effectiveness is a major advantage at $4.30 per 1,000 minutes of transcription, making it suitable for high-volume applications. However, if absolute maximum accuracy is the only priority and latency is not a constraint, alternatives like ElevenLabs Scribe v2 (2.3% WER) may offer marginal improvements. For voice agents and IVR systems, Nova-3's latency advantage outweighs the 3-point WER gap versus higher-accuracy competitors.
Limitations
Limitations

Deepgram | Nova-3 | Speech to Text cannot guarantee perfect accuracy in all scenarios, particularly with heavy accents, multiple simultaneous speakers, or extreme background noise. The model's 5.26% WER, while competitive, trails specialized high-accuracy alternatives in controlled environments. Speaker diarization works best with clearly separated speakers and may struggle with overlapping speech. The model does not perform real-time speaker identification (matching speakers to known identities); it only separates different speakers. Language detection is automatic but may misidentify mixed-language audio. Processing latency, while sub-300ms for streaming, cannot be reduced below hardware and network constraints.
---

Related models

4 models

Incredibly Fast Whisper AI model preview

Incredibly Fast WhisperOpenAI

Wizper with TimestampOpenAI

Rvc DatasetRVC Project

Whisper DiarizationOpenAI

* FAQ

About Deepgram Nova-3 · Speech to Text

01 / 03

What is Deepgram Nova-3 Speech to Text?

Deepgram Nova-3 Speech to Text is a high-accuracy ASR model by Deepgram that transcribes audio into text with low latency. It supports multiple languages, handles diverse accents, and returns structured transcripts with timestamps suitable for both batch and near-real-time transcription workflows.

Deepgram Nova-3 · Speech to Text