Ovi | Text to Video

each::sense is in private beta.
Eachlabs | AI Workflows for app builders

OVI

Ovi introduces a unified paradigm for audio-video generation seamlessly combining image, text, and sound to produce coherent, cinematic video outputs where motion, visuals, and audio are generated together with natural synchronization and depth.

Avg Run Time: 45.000s

Model Slug: ovi-text-to-video

Release Date: October 15, 2025

Playground

Input

Advanced Controls

Output

Example Result

Preview and download your result.

Each execution costs $0.2000. With $1 you can run this model about 5 times.

API & SDK

Create a Prediction

Send a POST request to create a new prediction. This will return a prediction ID that you'll use to check the result. The request should include your model inputs and API key.

Get Prediction Result

Poll the prediction endpoint with the prediction ID until the result is ready. The API uses long-polling, so you'll need to repeatedly check until you receive a success status.

Readme

Table of Contents
Overview
Technical Specifications
Key Considerations
Tips & Tricks
Capabilities
What Can I Use It For?
Things to Be Aware Of
Limitations

Overview

Ovi is an advanced, open-source AI model for unified audio-video generation, developed by Character.AI and introduced in 2025. It is designed to generate synchronized, cinematic video and audio content from text prompts or a combination of text and images. Unlike traditional pipelines that treat video and audio as separate outputs, Ovi models both modalities as a single generative process, ensuring natural synchronization and coherence between visuals and sound.

The core innovation of Ovi lies in its twin-backbone architecture, where identical DiT (Diffusion Transformer) modules are used for both audio and video branches. These branches are fused using blockwise cross-modal fusion, allowing for fine-grained exchange of timing and semantic information. This enables Ovi to produce high-quality, movie-grade video clips with realistic speech, emotional expression, and context-matched sound effects. Ovi is particularly notable for its ability to handle complex scenarios such as multi-person dialogues, precise lip-sync, and contextual sound generation, all within a unified framework.

Technical Specifications

  • Architecture: Twin-DiT modules with blockwise cross-modal fusion
  • Parameters: Approximately 11 billion (5B visual, 5B audio, 1B fusion)
  • Resolution: Supports up to 960×960 (e.g., 720×1280, 704×1344, 720×720); default is 720×720 at 24 FPS
  • Input/Output formats: Inputs can be text-only or text plus image; outputs are 5-second video clips with synchronized audio
  • Performance metrics: Generates 5-second videos at 24 FPS; high-quality audio and video synchronization; optimized for short-form, cinematic content

Key Considerations

  • Ovi excels at human-centric scenarios due to its training data bias; best results are achieved with prompts involving people, dialogue, or emotional expression
  • For optimal output, provide detailed prompts specifying scene, characters, camera movement, and mood
  • Use embedded tags to control speech and background audio (e.g., for speech, for audio cues)
  • Outputs may vary between runs due to stochastic generation; try multiple random seeds for best results
  • High spatial compression and large model size balance speed and memory usage, but may limit extremely fine details or complex textures
  • Best suited for short-form content (5 seconds); longer sequences may require stitching or post-processing

Tips & Tricks

  • Use explicit speech tags ( ... ) to generate dialogue with accurate lip-sync and speaker differentiation
  • Describe background sounds or music using ... to guide contextual audio generation
  • For multi-person scenes, specify each speaker and their lines separately to achieve realistic conversations
  • Adjust aspect ratios (e.g., 9:16, 16:9, 1:1) to match the intended output format; higher resolutions (up to 960×960) can improve visual quality
  • Iterate with different random seeds to select the most coherent or visually appealing result
  • For best lip-sync and emotional expression, provide clear, concise speech content and specify emotions or tone in the prompt
  • Avoid overly complex scenes with many tiny objects, as fine detail may be lost due to model compression

Capabilities

  • Generates synchronized video and audio from text or text+image prompts in a single unified process
  • Supports high-quality, context-matched speech, sound effects, and background music
  • Achieves precise lip synchronization without explicit face bounding boxes
  • Handles multi-person dialogue and multi-turn conversations naturally
  • Produces cinematic, movie-grade short video clips with diverse emotions and large motion ranges
  • Flexible aspect ratio and resolution support for various creative needs
  • Open-source release with pretrained weights and inference code for research and development

What Can I Use It For?

  • Creating talking avatars and digital humans for entertainment, marketing, or virtual assistants
  • Generating short cinematic video clips with synchronized dialogue and sound for storytelling or advertising
  • Producing educational content with animated speakers and contextual audio cues
  • Rapid prototyping of video ideas for filmmakers, animators, and content creators
  • Developing interactive media, such as conversational agents or immersive experiences
  • Showcasing creative projects involving music videos, sound effects, or emotional performances
  • Enabling research in multimodal AI, audio-visual synthesis, and human-computer interaction

Things to Be Aware Of

  • Some users report that Ovi's outputs are most consistent and high-quality in human-centric scenarios; non-human or abstract prompts may yield less coherent results
  • The model's reliance on high spatial compression can limit the rendering of intricate textures or very small details
  • Video length is currently limited to 5 seconds per generation; longer content requires manual assembly
  • Resource requirements are significant due to the 11B parameter size; high-end GPUs are recommended for local inference
  • Outputs can vary between runs; users often try multiple seeds to select the best result
  • Community feedback highlights the model's impressive lip-sync and emotional expressiveness, especially for dialogue-driven content
  • Some users note occasional artifacts or inconsistencies in complex scenes, particularly with rapid motion or crowded backgrounds
  • Positive reviews emphasize the ease of generating synchronized audio and video, as well as the open-source availability for experimentation

Limitations

  • Limited to short-form (5-second) video generation; not optimal for long-form or continuous video production
  • Visual detail and texture complexity are constrained by model compression and architecture, making it less suitable for scenes with many small or intricate elements
  • Best performance is achieved with human-centric prompts; less effective for non-human, abstract, or highly technical scenarios

Output Format: MP4

Pricing

Pricing Detail

This model runs at a cost of $0.20 per execution.

Pricing Type: Fixed

The cost remains the same regardless of which model you use or how long it runs. There are no variables affecting the price. It is a set, fixed amount per run, as the name suggests. This makes budgeting simple and predictable because you pay the same fee every time you execute the model.