What can I do with xAI Speech-to-Text v1 on each::labs?

On each::labs, xAI Speech-to-Text v1 fits meeting notes, podcast editing, captioning, call analytics, voice-of-customer research, and any workflow where spoken content has to become searchable text. Diarization and timestamps make it especially useful for multi-speaker audio like interviews and team calls.

What kind of output does xAI Speech-to-Text v1 produce?

xAI Speech-to-Text v1 returns transcripts with speaker labels, word-level timestamps, and normalized formatting for numbers and dates. That structure makes the output easier to drop into editing tools, captioning workflows, or analytics pipelines than a single block of plain text. Multichannel audio is also supported.

Example output

{
  "duration": 21.46,
  "language": "English",
  "text": "XAI Speech-to-Text is now live! Transcribe your audio into accurate text across multiple languages with speaker detection, word-level timestamps, and clean formatting. Whether you are working with interviews, podcasts, or video content, this model handles it all with speed and precision. Try it yourself today",
  "words": [
    {
      "end": 0.72,
      "start": 0.24,
      "text": "XAI"
    },
    {
      "end": 1.64,
      "start": 0.84,
      "text": "Speech-to-Text"
    },
    {
      "end": 1.78,
      "start": 1.7,
      "text": "is"
    },
    {
      "end": 2.08,
      "start": 1.88,
      "text": "now"
    },
    {
      "end": 2.64,
      "start": 2.2,
      "text": "live!"
    },
    {
      "end": 3.72,
      "start": 3.1,
      "text": "Transcribe"
    },
    {
      "end": 3.88,
      "start": 3.74,
      "text": "your"
    },
    {
      "end": 4.3,
      "start": 4.02,
      "text": "audio"
    },
    {
      "end": 4.68,
      "start": 4.52,
      "text": "into"
    },
    {
      "end": 5.29,
      "start": 4.86,
      "text": "accurate"
    },
    {
      "end": 5.75,
      "start": 5.35,
      "text": "text"
    },
    {
      "end": 6.15,
      "start": 5.89,
      "text": "across"
    },
    {
      "end": 6.67,
      "start": 6.29,
      "text": "multiple"
    },
    {
      "end": 7.21,
      "start": 6.69,
      "text": "languages"
    },
    {
      "end": 7.73,
      "start": 7.61,
      "text": "with"
    },
    {
      "end": 8.09,
      "start": 7.79,
      "text": "speaker"
    },
    {
      "end": 8.63,
      "start": 8.13,
      "text": "detection,"
    },
    {
      "end": 9.57,
      "start": 9.07,
      "text": "word-level"
    },
    {
      "end": 10.29,
      "start": 9.63,
      "text": "timestamps,"
    },
    {
      "end": 10.77,
      "start": 10.65,
      "text": "and"
    },
    {
      "end": 11.21,
      "start": 10.85,
      "text": "clean"
    },
    {
      "end": 11.89,
      "start": 11.31,
      "text": "formatting."
    },
    {
      "end": 12.63,
      "start": 12.45,
      "text": "Whether"
    },
    {
      "end": 12.75,
      "start": 12.65,
      "text": "you"
    },
    {
      "end": 12.89,
      "start": 12.79,
      "text": "are"
    },
    {
      "end": 13.21,
      "start": 12.89,
      "text": "working"
    },
    {
      "end": 13.35,
      "start": 13.23,
      "text": "with"
    },
    {
      "end": 13.99,
      "start": 13.43,
      "text": "interviews,"
    },
    {
      "end": 14.89,
      "start": 14.21,
      "text": "podcasts,"
    },
    {
      "end": 15.07,
      "start": 14.99,
      "text": "or"
    },
    {
      "end": 15.4,
      "start": 15.14,
      "text": "video"
    },
    {
      "end": 15.98,
      "start": 15.5,
      "text": "content,"
    },
    {
      "end": 16.72,
      "start": 16.54,
      "text": "this"
    },
    {
      "end": 17.08,
      "start": 16.8,
      "text": "model"
    },
    {
      "end": 17.52,
      "start": 17.16,
      "text": "handles"
    },
    {
      "end": 17.64,
      "start": 17.56,
      "text": "it"
    },
    {
      "end": 18.14,
      "start": 17.92,
      "text": "all"
    },
    {
      "end": 18.4,
      "start": 18.24,
      "text": "with"
    },
    {
      "end": 18.94,
      "start": 18.48,
      "text": "speed"
    },
    {
      "end": 19.24,
      "start": 19.12,
      "text": "and"
    },
    {
      "end": 19.76,
      "start": 19.28,
      "text": "precision."
    },
    {
      "end": 20.59,
      "start": 20.4,
      "text": "Try"
    },
    {
      "end": 20.65,
      "start": 20.59,
      "text": "it"
    },
    {
      "end": 20.95,
      "start": 20.65,
      "text": "yourself"
    },
    {
      "end": 21.29,
      "start": 20.99,
      "text": "today"
    }
  ]
}

xAI · Speech to Text

Object·xai-stt·by xAI

xAI Speech-to-Text v1 transcribes audio into text across 25 languages with speaker diarization, word-level timestamps, and clean formatting.

Try it now →

API reference

Runtime (p50): 10s
Estimated price: $0.000028 / sec

Call the API

prediction.sh

curl -X POST \
  -H "X-API-Key: $EACHLABS_API_KEY" \
  -H "Content-Type: application/json" \
  --data '{
    "model": "xai-speech-to-text",
    "version": "0.0.1",
    "input": {
        "format": true,
        "language": "auto",
        "audio_url": "https://cdn-us.eachlabs.ai/uploads/50778f39-1b84-4bac-b39e-f0743e581066.mp3"
    },
    "webhook_url": ""
}' \
  https://api.eachlabs.ai/v1/prediction/

Documentation8 sections

Overview
xAI | Speech to Text Overview

The xAI | Speech to Text model from xAI transforms spoken audio into accurate text transcripts, solving the challenge of efficient voice-to-text conversion for real-time applications. Part of the xai-stt family, this voice-to-text tool excels in handling diverse accents and noisy environments, setting it apart with xAI's advanced neural architectures trained on vast multilingual datasets. Developers and creators on each::labs (eachlabs.ai) can integrate the xAI | Speech to Text API seamlessly for transcription tasks. Whether capturing meetings, podcasts, or voice commands, it delivers reliable output with minimal latency. This model stands out for its precision in technical jargon and conversational speech, making it ideal for AI-driven workflows.
Capabilities
Capabilities
- Real-time streaming transcription for live audio feeds.
- Multilingual support with automatic language detection.
- Timestamped output for subtitle generation and analysis.
- Robust handling of accents, dialects, and overlapping speakers.
- Custom vocabulary integration for technical terms.
- Noise-robust processing in environments up to 30dB SNR.
- JSON export with confidence scores per segment.
- Integration-ready xAI | Speech to Text API for apps.
Use cases
Use Cases for xAI | Speech to Text

Content Creators: Podcasters transcribe episodes quickly—e.g., upload MP3 and get SRT subtitles using timestamped output for YouTube.

Marketers: Analyze customer calls by converting sales audio to text, leveraging accent handling for global teams: "Transcribe Zoom recording with French accents."

Developers: Build voice assistants with real-time streaming, integrating the xAI | Speech to Text API for low-latency command recognition.

Researchers: Process interviews in noisy settings, extracting quotes via confidence scores: "Transcribe field recordings from multiple speakers."

These scenarios highlight its precision across user profiles on each::labs.
Tips & tricks
Tips and Tricks

Optimize xAI voice-to-text results by preprocessing audio to reduce noise using tools like FFmpeg. Specify language and enable timestamps in API calls for structured output: {"audio": "file.wav", "language": "en-US", "timestamps": true}. For domain-specific accuracy, include context in prompts if supported, like "transcribe medical lecture."

Example prompts:
- "Transcribe this podcast episode on AI ethics."
- "Convert sales call audio to timed subtitles in Spanish."
- "Extract key quotes from noisy conference recording."
Batch multiple files to cut processing time, and test with short clips first to refine parameters on each::labs.
Technical spec
Technical Specifications
- Input Formats: Supports WAV, MP3, FLAC, and M4A audio files up to 60 minutes in duration.
- Output Format: Plain text, JSON with timestamps, or SRT subtitles.
- Sample Rates: 8kHz to 48kHz, with automatic resampling.
- Language Support: Over 100 languages, including English, Spanish, Mandarin, and Hindi.
- Processing Time: Real-time for live streaming; 1x speed for batch files (e.g., 1-minute audio in ~1 second).
- Architecture: Transformer-based encoder-decoder with connectionist temporal classification (CTC) for end-to-end transcription.
- Max File Size: 100MB per request via the xAI | Speech to Text API.
Things to be aware of
Things to Be Aware Of

xAI | Speech to Text may struggle with extreme overlaps in multi-speaker audio without diarization enabled. Common mistakes include uploading low-quality, compressed files—always use lossless formats. Resource needs are minimal: standard CPU/GPU suffices for API calls. Edge cases like heavy reverb or rare dialects reduce accuracy to ~85%. Test in production-like conditions and monitor confidence scores to flag issues early.
Key considerations
Key Considerations

Before using xAI | Speech to Text, ensure audio inputs are clear with minimal distortion, as heavy background noise can impact accuracy. No specific prerequisites beyond an API key from xAI, accessible via each::labs. This model shines in batch processing for long-form content over live streaming alternatives, offering better value for high-volume transcription at scale. Cost scales with audio duration, balancing performance with affordability for developers. Opt for it when precision in multilingual or accented speech matters more than ultra-low latency.
Limitations
Limitations

The xAI | Speech to Text model caps at 60-minute files and lacks built-in speaker diarization. It performs below 90% on very noisy audio (>40dB) or uncommon languages without fine-tuning. No video input support—audio extraction required. Output lacks punctuation in casual speech modes. For specialized medical/legal needs, custom vocab helps but isn't foolproof.

Related models

4 models

Rvc DatasetRVC Project

WhisperOpenAI

Kling · Voice CreateKling

Alibaba Qwen3 ASR Flash Filetrans · Speech to Text AI model preview

Alibaba Qwen3 ASR Flash Filetrans · Speech to TextAlibaba

* FAQ

About xAI · Speech to Text

01 / 03

What is xAI Speech-to-Text v1?

xAI Speech-to-Text v1 is an audio transcription model from xAI that converts spoken audio into structured text. It supports 25 languages, identifies different speakers through diarization, attaches word-level timestamps, and normalizes spoken numbers and dates into proper written form, so the output is closer to a clean transcript than raw text.

xAI · Speech to Text