Video Transcribe

whisperx-video-transcribe

Transform videos into accurate, text-based transcripts effortlessly with Video Transcribe

T4 16GB
Fast Inference
REST API

Model Information

Response Time~35 sec
StatusActive
Version
0.0.1
Updatedabout 1 month ago

Prerequisites

  • Create an API Key from the Eachlabs Console
  • Install the required dependencies for your chosen language (e.g., requests for Python)

API Integration Steps

1. Create a Prediction

Send a POST request to create a new prediction. This will return a prediction ID that you'll use to check the result. The request should include your model inputs and API key.

import requests
import time
API_KEY = "YOUR_API_KEY" # Replace with your API key
HEADERS = {
"X-API-Key": API_KEY,
"Content-Type": "application/json"
}
def create_prediction():
response = requests.post(
"https://api.eachlabs.ai/v1/prediction/",
headers=HEADERS,
json={
"model": "whisperx-video-transcribe",
"version": "0.0.1",
"input": {
"url": "your url here",
"debug": false,
"batch_size": "16"
}
}
)
prediction = response.json()
if prediction["status"] != "success":
raise Exception(f"Prediction failed: {prediction}")
return prediction["predictionID"]

2. Get Prediction Result

Poll the prediction endpoint with the prediction ID until the result is ready. The API uses long-polling, so you'll need to repeatedly check until you receive a success status.

def get_prediction(prediction_id):
while True:
result = requests.get(
f"https://api.eachlabs.ai/v1/prediction/{prediction_id}",
headers=HEADERS
).json()
if result["status"] == "success":
return result
elif result["status"] == "error":
raise Exception(f"Prediction failed: {result}")
time.sleep(1) # Wait before polling again

3. Complete Example

Here's a complete example that puts it all together, including error handling and result processing. This shows how to create a prediction and wait for the result in a production environment.

try:
# Create prediction
prediction_id = create_prediction()
print(f"Prediction created: {prediction_id}")
# Get result
result = get_prediction(prediction_id)
print(f"Output URL: {result['output']}")
print(f"Processing time: {result['metrics']['predict_time']}s")
except Exception as e:
print(f"Error: {e}")

Additional Information

  • The API uses a two-step process: create prediction and poll for results
  • Response time: ~35 seconds
  • Rate limit: 60 requests/minute
  • Concurrent requests: 10 maximum
  • Use long-polling to check prediction status until completion

Overview

WhisperX is an advanced transcription model designed for video and audio processing. Developed by adidoes, this model leverages cutting-edge AI to provide accurate, efficient, and context-aware transcriptions. Its capabilities extend beyond basic transcription, including speaker identification and timestamping, making it a powerful model for multimedia content creators and analysts.

Technical Specifications

  • Architecture: Built on the Whisper architecture with enhancements for real-time processing and multi-language support.
  • Speaker Diarization: Identifies and labels multiple speakers in audio, aiding in meeting transcription and interview analysis.
  • Timestamping: Generates precise timestamps for each segment of the transcription, enabling easy navigation and editing.

Key Considerations

  • Audio Quality: Background noise, overlapping speech, or low-quality recordings can affect transcription accuracy.
  • Language Accuracy: While the model is proficient in multiple languages, certain dialects or rare languages may yield less accurate results.

Tips & Tricks

  • Audio Quality:
    • For best results, use clear, high-quality audio files with minimal background noise.
    • Pre-process noisy or distorted audio to improve transcription accuracy.
  • Timestamps:
    • Ensure the audio has consistent pacing for accurate timestamp alignment.

Capabilities

  • Accurate Transcription: Provides high-quality transcriptions with minimal errors for clear audio sources.
  • Speaker Labeling: Identifies and tags individual speakers, aiding in multi-speaker content analysis.

What can I use for?

  • Meeting Transcriptions: Record and transcribe meetings or interviews for easy documentation and review.
  • Podcast Summaries: Convert podcast audio into text for blog posts, summaries, or SEO optimization.

Things to be aware of

  • Podcast and Interview Transcription:
    • Convert audio content into searchable, editable text for archiving or publication.
  • Academic and Market Research:
    • Transcribe focus groups, interviews, or lectures for data analysis and reporting.
  • Language Practice and Learning:
    • Use transcriptions to study pronunciation, grammar, and vocabulary in real-world contexts.

Limitations

  • Background Noise Sensitivity: The model may struggle with heavily distorted or noisy audio sources.
  • Complex Speaker Overlap: In scenarios with multiple speakers talking simultaneously, diarization may not be fully accurate.

Output Format: Text

Related AI Models

incredibly-fast-whisper

Incredibly Fast Fhisper

incredibly-fast-whisper

Voice to Text