OmniHuman

omnihuman

OmniHuman is an image-to-video generation model that creates realistic videos or animations from an image and performs lip sync with audio.

Partner Model
Fast Inference
REST API

Model Information

Response Time~200 sec
StatusActive
Version
0.0.1
Updated3 days ago

Prerequisites

  • Create an API Key from the Eachlabs Console
  • Install the required dependencies for your chosen language (e.g., requests for Python)

API Integration Steps

1. Create a Prediction

Send a POST request to create a new prediction. This will return a prediction ID that you'll use to check the result. The request should include your model inputs and API key.

import requests
import time
API_KEY = "YOUR_API_KEY" # Replace with your API key
HEADERS = {
"X-API-Key": API_KEY,
"Content-Type": "application/json"
}
def create_prediction():
response = requests.post(
"https://api.eachlabs.ai/v1/prediction/",
headers=HEADERS,
json={
"model": "omnihuman",
"version": "0.0.1",
"input": {
"mode": "normal",
"audio_url": "https://storage.googleapis.com/magicpoint/inputs/omnihuman_audio.mp3",
"image_url": "https://storage.googleapis.com/magicpoint/models/women.png"
}
}
)
prediction = response.json()
if prediction["status"] != "success":
raise Exception(f"Prediction failed: {prediction}")
return prediction["predictionID"]

2. Get Prediction Result

Poll the prediction endpoint with the prediction ID until the result is ready. The API uses long-polling, so you'll need to repeatedly check until you receive a success status.

def get_prediction(prediction_id):
while True:
result = requests.get(
f"https://api.eachlabs.ai/v1/prediction/{prediction_id}",
headers=HEADERS
).json()
if result["status"] == "success":
return result
elif result["status"] == "error":
raise Exception(f"Prediction failed: {result}")
time.sleep(1) # Wait before polling again

3. Complete Example

Here's a complete example that puts it all together, including error handling and result processing. This shows how to create a prediction and wait for the result in a production environment.

try:
# Create prediction
prediction_id = create_prediction()
print(f"Prediction created: {prediction_id}")
# Get result
result = get_prediction(prediction_id)
print(f"Output URL: {result['output']}")
print(f"Processing time: {result['metrics']['predict_time']}s")
except Exception as e:
print(f"Error: {e}")

Additional Information

  • The API uses a two-step process: create prediction and poll for results
  • Response time: ~200 seconds
  • Rate limit: 60 requests/minute
  • Concurrent requests: 10 maximum
  • Use long-polling to check prediction status until completion

Overview

OmniHuman is an advanced technology developed by ByteDance researchers that creates highly realistic human videos from a single image and a motion signal, such as audio or video. It can animate portraits, half-body, or full-body images with natural movements and lifelike gestures. By combining different inputs, like images and sound, OmniHuman brings still images to life with remarkable detail and realism.

Technical Specifications

  • Modes:
    • Normal: Standard output generation with balanced processing speed and accuracy.
    • Dynamic: More flexible and adaptive response with a focus on contextual awareness.
  • Input Handling: Supports multiple formats and performs pre-processing for enhanced output quality.
  • Output Generation: Generates coherent and high-fidelity human-like responses based on the provided inputs.

Key Considerations

  • High-resolution images yield better performance compared to low-quality images.
  • Background noise in audio files can impact accuracy.
  • Dynamic mode may require more processing time but offers better adaptability.
  • The model is optimized for  faces;  images may lead to unexpected results.
  • Ensure URLs are accessible and not restricted by security settings.

Tips & Tricks

  • Mode Selection:
    • Use normal mode for standard, structured responses.
    • Use dynamic mode for more adaptive and nuanced outputs.
  • Audio Input (audio_url):
    • Prefer lossless formats (e.g., WAV) over compressed formats (e.g., MP3) for better clarity.
    • Keep audio length within a reasonable range to avoid processing delays.
    • Ensure the speech is clear, with minimal background noise.
    • Audio Normal Mode Length Limit: In normal mode, the maximum supported audio length is 180 seconds.
    • Audio Dynamic Mode Length Limit: In dynamic mode, the maximum audio length supported for pets is 90 seconds, and for real-person images, it is 180 seconds.
  • Image Input (image_url):
    • Use high-resolution, well-lit, front-facing images.
    • Avoid extreme facial angles or obstructions (e.g., sunglasses, masks) for best results.
    • Images with neutral expressions tend to produce more reliable outputs.
    • Supported Normal Mode Input Types: It supports the driving of all types of pictures, including those of real people, anime, and pets.
    • Supported Dynamic Mode Input Types: It supports the driving of all types of pictures, including those of real people, anime, and pets.
  • Output:
    • Normal Mode Output Feature: It supports the output of the original image in its proportional form.
    • Dynamic Mode Output Feature: The original image will be cropped to a fixed aspect ratio of 1:1 for output, with a resolution of 512 * 512.

Capabilities

  • Processes both audio and image inputs to generate human-like responses.
  • Adapts to different scenarios using configurable modes.
  • Supports real-time and batch processing.
  • Handles a variety of input formats for flexible usage.
  • Ensures coherence between audio and image-based outputs.

What can I use for?

  • Voice and facial recognition-based response systems.
  • Interactive AI-driven conversational agents.
  • Enhanced multimedia content creation.
  • Automated dubbing and voice sync applications.
  • Contextually aware AI-based character simulation.

Things to be aware of

  • Experiment with different image angles to observe variations in output.
  • Use high-quality audio inputs to test response accuracy.
  • Compare normal and dynamic modes for different response behaviors.
  • Process multiple inputs to evaluate consistency in generated outputs.
  • Try combining varied voice tones and facial expressions to analyze adaptability.

Limitations

  • Performance may vary based on the quality of input data.
  • Complex or noisy backgrounds in images can lead to inaccurate outputs with OmniHuman by ByteDance.
  • Poor audio quality may result in misinterpretations.
  • Processing time for OmniHuman by ByteDance may increase for larger files or complex scenarios.
  • The model is primarily trained on human faces; other objects may yield unexpected results.
  • Audio Normal Mode Length Limit: In normal mode, the maximum supported audio length is 180 seconds.
  • Audio Dynamic Mode Length Limit: In dynamic mode, the maximum audio length supported for pets is 90 seconds, and for real-person images, it is 180 seconds.

Output Format: MP4

Related AI Models

Kling AI Image to Video

Kling v1.6 Image to Video

kling-ai-image-to-video

Image to Video
magic-animate

Magic Animate

magic-animate

Image to Video
live-portrait

Live Portrait

live-portrait

Image to Video
sadtalker

SadTalker

sadtalker

Image to Video