Example output

[
  {
    "channel_id": 0,
    "sentences": [
      {
        "begin_time": 80,
        "emotion": "neutral",
        "end_time": 7440,
        "language": "es",
        "sentence_id": 0,
        "text": "Hola a todos, estoy aquí para presentar Eatch Labs, IA que hace que las ideas sean reales y la creatividad sea simple.",
        "words": [
          {
            "begin_time": 80,
            "end_time": 320,
            "punctuation": "",
            "text": "Hola "
          },
          {
            "begin_time": 320,
            "end_time": 320,
            "punctuation": "",
            "text": "a "
          },
          {
            "begin_time": 320,
            "end_time": 480,
            "punctuation": ",",
            "text": "todos"
          },
          {
            "begin_time": 480,
            "end_time": 640,
            "punctuation": "",
            "text": " estoy "
          },
          {
            "begin_time": 1440,
            "end_time": 1840,
            "punctuation": "",
            "text": "aquí "
          },
          {
            "begin_time": 2960,
            "end_time": 3360,
            "punctuation": "",
            "text": "para "
          },
          {
            "begin_time": 3360,
            "end_time": 3360,
            "punctuation": "",
            "text": "presentar "
          },
          {
            "begin_time": 3840,
            "end_time": 4480,
            "punctuation": "",
            "text": "Eatch "
          },
          {
            "begin_time": 4480,
            "end_time": 4480,
            "punctuation": ",",
            "text": "Labs"
          },
          {
            "begin_time": 4480,
            "end_time": 4480,
            "punctuation": "",
            "text": " IA "
          },
          {
            "begin_time": 4480,
            "end_time": 4480,
            "punctuation": "",
            "text": "que "
          },
          {
            "begin_time": 4480,
            "end_time": 4480,
            "punctuation": "",
            "text": "hace "
          },
          {
            "begin_time": 5120,
            "end_time": 5360,
            "punctuation": "",
            "text": "que "
          },
          {
            "begin_time": 5360,
            "end_time": 5440,
            "punctuation": "",
            "text": "las "
          },
          {
            "begin_time": 5440,
            "end_time": 5440,
            "punctuation": "",
            "text": "ideas "
          },
          {
            "begin_time": 5440,
            "end_time": 5440,
            "punctuation": "",
            "text": "sean "
          },
          {
            "begin_time": 5440,
            "end_time": 5440,
            "punctuation": "",
            "text": "reales "
          },
          {
            "begin_time": 6240,
            "end_time": 6720,
            "punctuation": "",
            "text": "y "
          },
          {
            "begin_time": 7200,
            "end_time": 7360,
            "punctuation": "",
            "text": "la "
          },
          {
            "begin_time": 7360,
            "end_time": 7360,
            "punctuation": "",
            "text": "creatividad "
          },
          {
            "begin_time": 7380,
            "end_time": 7400,
            "punctuation": "",
            "text": "sea "
          },
          {
            "begin_time": 7420,
            "end_time": 7440,
            "punctuation": ".",
            "text": "simple"
          }
        ]
      }
    ],
    "text": "Hola a todos, estoy aquí para presentar Eatch Labs, IA que hace que las ideas sean reales y la creatividad sea simple."
  }
]

Alibaba Qwen3 ASR Flash Filetrans · Speech to Text

Object·qwen3-asr·by Alibaba

Qwen3-ASR-Flash-Filetrans transcribes audio files into text with support for 26 languages, emotion detection, and word-level timestamps. It is optimized for long audio files (up to 2GB, 12 hours) using asynchronous batch processing. The model supports formats including aac, amr, flac, m4a, mp3, ogg, opus, wav, webm, wma, wmv, as well as video containers. Additional features include inverse text normalization, multi-channel audio transcription, and context biasing for domain-specific vocabulary.

Try it now →

API reference

Runtime (p50): 15s
Estimated price: $0.000035 / unit

Call the API

prediction.sh

curl -X POST \
  -H "X-API-Key: $EACHLABS_API_KEY" \
  -H "Content-Type: application/json" \
  --data '{
    "model": "alibaba-qwen3-asr-flash-filetrans-speech-to-text",
    "version": "0.0.1",
    "input": {
        "audio_url": "https://storage.googleapis.com/magicpoint/inputs/video-translate-input.mp4",
        "enable_itn": true,
        "enable_words": true,
        "language": "es"
    },
    "webhook_url": ""
}' \
  https://api.eachlabs.ai/v1/prediction/

Documentation8 sections

Overview
Alibaba | Qwen3 ASR Flash Filetrans | Speech to Text Overview

The Alibaba | Qwen3 ASR Flash Filetrans | Speech to Text model from Alibaba's Qwen3 family converts audio files into accurate text transcripts, solving challenges in processing long-form speech data across multiple languages. Part of the Qwen3-ASR series, it excels in handling large files up to 2GB or 12 hours via asynchronous batch processing, a key differentiator for efficiency in high-volume transcription tasks. This Alibaba voice-to-text solution supports 26 languages with features like emotion detection and word-level timestamps, making it ideal for developers and creators needing precise, context-aware outputs on each::labs.

Optimized for real-world applications, it processes diverse formats including aac, mp3, wav, and video containers, while offering inverse text normalization and multi-channel support. Available through the Alibaba | Qwen3 ASR Flash Filetrans | Speech to Text API on each::labs, it streamlines workflows for podcasting, meetings, and content analysis without real-time constraints.
Capabilities
Capabilities
- Transcribes audio in 26 languages with high accuracy, including dialects like Cantonese.
- Provides word-level timestamps for precise timing in edits or subtitles.
- Detects emotions in speech, tagging outputs for sentiment analysis.
- Handles multi-channel audio, separating and transcribing multiple speakers.
- Supports inverse text normalization for natural, readable text output.
- Enables context biasing to boost recognition of custom or domain-specific vocabulary.
- Processes large files up to 2GB/12 hours asynchronously for batch efficiency.
- Compatible with video containers, extracting speech from multimedia sources.
Use cases
Use Cases for Alibaba | Qwen3 ASR Flash Filetrans | Speech to Text

For content creators: Transcribe long podcast episodes with word-level timestamps and emotion detection. Example: "Generate timed subtitles for a 4-hour interview video, tag excitement in key moments." This leverages multi-channel support for host-guest separation.

For marketers: Analyze customer call recordings across languages using context biasing for brand terms. Example: "Transcribe sales calls in English and Spanish, normalize numbers and normalize product names." Emotion tags reveal sentiment trends.

For developers: Build apps with the Alibaba | Qwen3 ASR Flash Filetrans | Speech to Text API on each::labs for batch-processing meeting archives. Example: "Process 10-hour webinar audio, output JSON with timestamps and speaker channels."

For researchers: Handle academic lectures in 26 languages with inverse text normalization. Example: "Transcribe conference panel, bias toward scientific terminology, include emotion for engagement analysis."
Tips & tricks
Tips and Tricks

Optimize Alibaba | Qwen3 ASR Flash Filetrans | Speech to Text by using context biasing parameters to prioritize domain-specific vocabulary, such as medical or legal terms, improving accuracy in niche transcripts. For multi-channel audio, specify channels in API calls to separate speakers effectively. Leverage word-level timestamps for post-editing by aligning with video timelines.

Example prompts for the Alibaba | Qwen3 ASR Flash Filetrans | Speech to Text API:
"Transcribe this podcast episode with emotion tags and timestamps for speaker changes."
"Process meeting audio, bias toward technical jargon in software development, output normalized text."
"Extract dialogue from video file, detect sentiments, handle Cantonese dialect."

Workflow tip: Pre-segment ultra-long files into 1-hour chunks for faster iteration, then merge outputs. Combine with Qwen3 alignment tools for refined timestamps on each::labs.
Technical spec
Technical Specifications
- Input Formats: Supports aac, amr, flac, m4a, mp3, ogg, opus, wav, webm, wma, wmv, and video containers.
- Max File Size/Duration: Up to 2GB or 12 hours of audio, ideal for extended recordings.
- Language Support: 26 languages with multilingual transcription capabilities.
- Output Features: Text transcripts with word-level timestamps, emotion detection, inverse text normalization, multi-channel audio handling, and context biasing for domain-specific terms.
- Processing Mode: Asynchronous batch processing for optimal handling of large files; average times vary by file length but prioritize efficiency over real-time.
- Architecture: Built on Qwen3-ASR family, leveraging non-autoregressive elements for alignment and transcription accuracy.
Things to be aware of
Things to Be Aware Of

Alibaba | Qwen3 ASR Flash Filetrans | Speech to Text may struggle with heavy background noise or overlapping speech in multi-channel inputs, reducing accuracy without clean audio. Common mistake: Uploading unsegmented ultra-long files without stable bandwidth, causing delays. Edge cases like rare dialects or rapid speech benefit from pre-testing. Resource needs are low for API use on each::labs, but monitor queue times during peak loads. Always verify timestamps against originals for video sync, as alignment can shift in complex acoustics.
Key considerations
Key Considerations

Before using Alibaba | Qwen3 ASR Flash Filetrans | Speech to Text, ensure audio files meet format and size limits, as it thrives on batch uploads rather than live streams. Ideal for long-duration tasks like lectures or interviews where precision outweighs speed; for real-time needs, consider Qwen3 family alternatives. No specific hardware prerequisites beyond API access via each::labs, but larger files benefit from stable connections. Cost scales with processing volume, offering strong value for high-accuracy, multilingual outputs versus faster but less detailed competitors. Test with sample files to gauge performance on noisy or accented speech.
Limitations
Limitations

The model focuses on batch file transcription, not real-time streaming, so live applications require alternatives. Limited to specified formats; unsupported codecs need conversion. Accuracy drops in extreme noise, heavy accents outside core 26 languages, or very low-quality recordings. No native video analysis beyond audio extraction—visual context unavailable. File size cap at 2GB enforces preprocessing for larger content. Outputs lack diarization without multi-channel input.