OVI
Ovi introduces a unified paradigm for audio-video generation seamlessly combining image, text, and sound to produce coherent, cinematic video outputs where motion, visuals, and audio are generated together with natural synchronization and depth.
Avg Run Time: 45.000s
Model Slug: ovi-text-to-video
Release Date: October 15, 2025
Playground
Input
Output
Example Result
Preview and download your result.
API & SDK
Create a Prediction
Send a POST request to create a new prediction. This will return a prediction ID that you'll use to check the result. The request should include your model inputs and API key.
Get Prediction Result
Poll the prediction endpoint with the prediction ID until the result is ready. The API uses long-polling, so you'll need to repeatedly check until you receive a success status.
Readme
Overview
Ovi is an advanced, open-source AI model for unified audio-video generation, developed by Character.AI and introduced in 2025. It is designed to generate synchronized, cinematic video and audio content from text prompts or a combination of text and images. Unlike traditional pipelines that treat video and audio as separate outputs, Ovi models both modalities as a single generative process, ensuring natural synchronization and coherence between visuals and sound.
The core innovation of Ovi lies in its twin-backbone architecture, where identical DiT (Diffusion Transformer) modules are used for both audio and video branches. These branches are fused using blockwise cross-modal fusion, allowing for fine-grained exchange of timing and semantic information. This enables Ovi to produce high-quality, movie-grade video clips with realistic speech, emotional expression, and context-matched sound effects. Ovi is particularly notable for its ability to handle complex scenarios such as multi-person dialogues, precise lip-sync, and contextual sound generation, all within a unified framework.
Technical Specifications
- Architecture: Twin-DiT modules with blockwise cross-modal fusion
- Parameters: Approximately 11 billion (5B visual, 5B audio, 1B fusion)
- Resolution: Supports up to 960×960 (e.g., 720×1280, 704×1344, 720×720); default is 720×720 at 24 FPS
- Input/Output formats: Inputs can be text-only or text plus image; outputs are 5-second video clips with synchronized audio
- Performance metrics: Generates 5-second videos at 24 FPS; high-quality audio and video synchronization; optimized for short-form, cinematic content
Key Considerations
- Ovi excels at human-centric scenarios due to its training data bias; best results are achieved with prompts involving people, dialogue, or emotional expression
- For optimal output, provide detailed prompts specifying scene, characters, camera movement, and mood
- Use embedded tags to control speech and background audio (e.g.,
for speech, for audio cues) - Outputs may vary between runs due to stochastic generation; try multiple random seeds for best results
- High spatial compression and large model size balance speed and memory usage, but may limit extremely fine details or complex textures
- Best suited for short-form content (5 seconds); longer sequences may require stitching or post-processing
Tips & Tricks
- Use explicit speech tags (
... ) to generate dialogue with accurate lip-sync and speaker differentiation - Describe background sounds or music using ... to guide contextual audio generation
- For multi-person scenes, specify each speaker and their lines separately to achieve realistic conversations
- Adjust aspect ratios (e.g., 9:16, 16:9, 1:1) to match the intended output format; higher resolutions (up to 960×960) can improve visual quality
- Iterate with different random seeds to select the most coherent or visually appealing result
- For best lip-sync and emotional expression, provide clear, concise speech content and specify emotions or tone in the prompt
- Avoid overly complex scenes with many tiny objects, as fine detail may be lost due to model compression
Capabilities
- Generates synchronized video and audio from text or text+image prompts in a single unified process
- Supports high-quality, context-matched speech, sound effects, and background music
- Achieves precise lip synchronization without explicit face bounding boxes
- Handles multi-person dialogue and multi-turn conversations naturally
- Produces cinematic, movie-grade short video clips with diverse emotions and large motion ranges
- Flexible aspect ratio and resolution support for various creative needs
- Open-source release with pretrained weights and inference code for research and development
What Can I Use It For?
- Creating talking avatars and digital humans for entertainment, marketing, or virtual assistants
- Generating short cinematic video clips with synchronized dialogue and sound for storytelling or advertising
- Producing educational content with animated speakers and contextual audio cues
- Rapid prototyping of video ideas for filmmakers, animators, and content creators
- Developing interactive media, such as conversational agents or immersive experiences
- Showcasing creative projects involving music videos, sound effects, or emotional performances
- Enabling research in multimodal AI, audio-visual synthesis, and human-computer interaction
Things to Be Aware Of
- Some users report that Ovi's outputs are most consistent and high-quality in human-centric scenarios; non-human or abstract prompts may yield less coherent results
- The model's reliance on high spatial compression can limit the rendering of intricate textures or very small details
- Video length is currently limited to 5 seconds per generation; longer content requires manual assembly
- Resource requirements are significant due to the 11B parameter size; high-end GPUs are recommended for local inference
- Outputs can vary between runs; users often try multiple seeds to select the best result
- Community feedback highlights the model's impressive lip-sync and emotional expressiveness, especially for dialogue-driven content
- Some users note occasional artifacts or inconsistencies in complex scenes, particularly with rapid motion or crowded backgrounds
- Positive reviews emphasize the ease of generating synchronized audio and video, as well as the open-source availability for experimentation
Limitations
- Limited to short-form (5-second) video generation; not optimal for long-form or continuous video production
- Visual detail and texture complexity are constrained by model compression and architecture, making it less suitable for scenes with many small or intricate elements
- Best performance is achieved with human-centric prompts; less effective for non-human, abstract, or highly technical scenarios
Output Format: MP4
Pricing
Pricing Detail
This model runs at a cost of $0.20 per execution.
Pricing Type: Fixed
The cost remains the same regardless of which model you use or how long it runs. There are no variables affecting the price. It is a set, fixed amount per run, as the name suggests. This makes budgeting simple and predictable because you pay the same fee every time you execute the model.
Related AI Models
You can seamlessly integrate advanced AI capabilities into your applications without the hassle of managing complex infrastructure.
