each::sense is in private beta.
Eachlabs | AI Workflows for app builders

OMNIHUMAN

OmniHuman creates realistic videos from an image and audio, making the character move and express emotions in sync with the sound.

Avg Run Time: 150.000s

Model Slug: bytedance-omnihuman

Playground

Input

Enter a URL or choose a file from your computer.

Enter a URL or choose a file from your computer.

Output

Example Result

Preview and download your result.

Cost is calculated based on output duration. $0.1400 per second. For $1 you can generate approximately 7 seconds of output.

API & SDK

Create a Prediction

Send a POST request to create a new prediction. This will return a prediction ID that you'll use to check the result. The request should include your model inputs and API key.

Get Prediction Result

Poll the prediction endpoint with the prediction ID until the result is ready. The API uses long-polling, so you'll need to repeatedly check until you receive a success status.

Readme

Table of Contents
Overview
Technical Specifications
Key Considerations
Tips & Tricks
Capabilities
What Can I Use It For?
Things to Be Aware Of
Limitations

Overview

OmniHuman is an advanced AI model developed by ByteDance that specializes in generating realistic, expressive videos from a single image and an audio track. The model is designed to animate a static human portrait, making the character move, speak, and express emotions in perfect sync with the provided audio. This capability is particularly valuable for creating talking-head videos, digital avatars, and personalized content at scale.

Key features of OmniHuman include high-fidelity lip-syncing, nuanced facial expression generation, and support for multiple languages and audio types. The model leverages a multimodal approach, combining image, audio, and optionally text inputs to produce cohesive, controllable video outputs. Its architecture is built for precision and adaptability, enabling users to steer the generated content with a high degree of control. OmniHuman stands out for its ability to maintain character consistency, deliver natural motion, and offer open-source accessibility for further development and customization.

Technical Specifications

  • Architecture: Multimodal video generation model (details not fully disclosed, but incorporates collaborative multimodal conditioning and reference-based synthesis)
  • Parameters: Not publicly specified
  • Resolution: Supports up to 720p; short side of video frame determines resolution
  • Input/Output formats:
  • Inputs: Image (jpg, jpeg, png, webp, gif, avif), Audio (mp3, ogg, wav, m4a, aac), Text (optional for scene control)
  • Output: Video (mp4)
  • Performance metrics:
  • High-precision lip-sync
  • Maintains character consistency across frames
  • Typical video length: up to 4 seconds per generation
  • Requires high VRAM for local runs

Key Considerations

  • High-quality input images yield more realistic and expressive video outputs
  • Audio clarity and proper trimming improve lip-sync accuracy and emotional expression
  • Multilingual support enables global content creation, but some languages may perform better than others depending on training data
  • For best results, ensure the subject in the image is facing forward with a neutral background
  • Video duration is limited (typically up to 4 seconds per generation), so plan content accordingly
  • Combining multiple reference images can enhance character consistency but may increase resource requirements
  • Prompt engineering (when using text input) allows for fine control over scene elements and actions
  • Quality vs speed: Higher resolutions and longer durations require more computational resources and time

Tips & Tricks

  • Use high-resolution, well-lit portrait images for the most natural facial animation
  • Clean, noise-free audio files improve synchronization and emotional nuance
  • For multilingual projects, test short samples in each target language to ensure lip-sync quality
  • If using text prompts, be specific about desired actions or emotions to steer the animation effectively
  • Experiment with different seeds to generate varied results from the same inputs
  • For iterative refinement, adjust the image or audio slightly and re-run to fine-tune expressions or timing
  • To maintain character consistency across multiple videos, use the same reference image and similar audio characteristics
  • Advanced: Combine photo, audio, and text inputs to add objects or scene elements for more complex video outputs

Capabilities

  • Generates expressive, lip-synced talking-head videos from a single image and audio track
  • Supports multilingual audio input for global content creation
  • Maintains high character consistency, even with multiple reference images
  • Allows for fine-grained control over facial expressions and emotions
  • Capable of integrating additional scene elements via text prompts
  • Produces raw, unfiltered outputs suitable for further post-processing
  • Adaptable for both creative and professional applications

What Can I Use It For?

  • Creating personalized video avatars for customer support, marketing, or education
  • Generating localized product update videos with native-language narration
  • Producing rapid, scalable content for social media, advertising, and entertainment
  • Powering digital humans in virtual events, games, or interactive experiences
  • Enabling creative projects such as AI-generated music videos, storytelling, or animation
  • Supporting accessibility by generating sign language or expressive avatars for diverse audiences
  • Automating video content creation for news, announcements, or internal communications

Things to Be Aware Of

  • Some experimental features, such as combining multiple reference images or adding scene objects, may require additional prompt tuning and computational resources
  • Users have reported that the model performs best with clear, frontal portrait images and high-quality audio
  • Community feedback highlights strong lip-sync accuracy and natural facial expressions as major strengths
  • Known quirks include occasional artifacts or unnatural movements if the input image is low quality or the audio is unclear
  • Performance benchmarks indicate that higher resolutions and longer videos require significant VRAM and processing time
  • Positive user feedback emphasizes the model’s controllability, open-source accessibility, and adaptability for diverse use cases
  • Common concerns include the short maximum video duration and the need for powerful hardware for local runs

Limitations

  • Maximum video length is limited (typically up to 4 seconds per generation), restricting use for longer-form content
  • Requires high-quality input images and audio for optimal results; subpar inputs can lead to artifacts or reduced realism
  • High computational resource requirements may limit accessibility for users without advanced hardware

Pricing

Pricing Detail

This model runs at a cost of $0.14 per execution.

The average execution time is 150 seconds, but this may vary depending on your input data and complexity.

The cost per run varies based on the generated output duration and complexity

Pricing Type: Cost Per Second

Cost Per Second means pricing is based on the generated output duration. The input prompt affects the pricing as it influences the length and complexity of the generated content. You pay for each second of output generated by the model.