each::sense is live
Eachlabs | AI Workflows for app builders

VIDU-2.0

Vidu 2.0 Reference to Video generates realistic motion by combining multiple reference photos into a seamless video.

Avg Run Time: 40.000s

Model Slug: vidu-2-0-reference-to-video

Playground

Input

Enter a URL or choose a file from your computer.

Enter a URL or choose a file from your computer.

Enter a URL or choose a file from your computer.

Enter a URL or choose a file from your computer.

Advanced Controls

Output

Example Result

Preview and download your result.

Each execution costs $0.005000. With $1 you can run this model about 200 times.

API & SDK

Create a Prediction

Send a POST request to create a new prediction. This will return a prediction ID that you'll use to check the result. The request should include your model inputs and API key.

Get Prediction Result

Poll the prediction endpoint with the prediction ID until the result is ready. The API uses long-polling, so you'll need to repeatedly check until you receive a success status.

Readme

Table of Contents
Overview
Technical Specifications
Key Considerations
Tips & Tricks
Capabilities
What Can I Use It For?
Things to Be Aware Of
Limitations

Overview

vidu-2-0-reference-to-video — Image-to-Video AI Model

Transform static reference photos into seamless, realistic motion videos with vidu-2-0-reference-to-video, Vidu's advanced image-to-video AI model from the vidu-2.0 family. This model excels by fusing multiple reference images—up to 4 images alongside 2 videos—into coherent short-form videos with exceptional consistency in characters, actions, and styles, solving the challenge of uncontrollable AI video outputs.

Developed by Vidu, vidu-2-0-reference-to-video powers production-grade workflows for creators seeking precise control over video generation from images. It supports native 1080p resolution and durations up to 16 seconds, making it ideal for Vidu image-to-video applications like short films and commercials where multi-reference fusion ensures pixel-level accuracy without post-production guesswork.

Technical Specifications

What Sets vidu-2-0-reference-to-video Apart

vidu-2-0-reference-to-video stands out in the image-to-video AI model landscape through its multimodal reference fusion, accepting 2 reference videos and 4 images simultaneously to cover six dimensions: special effects, expressions, textures, actions, characters, and scenes. This enables precise replication and transfer of elements like character identities or dynamic motions, turning random generations into controllable edits akin to a video format brush.

Unlike basic image-to-video tools, it delivers production-level coherence with 1080p resolution, up to 16-second clips, and smooth temporal consistency, reducing flicker in multi-subject scenes. Users gain storyboard-level control, importing references for fused outputs that maintain logical flow across shots.

  • Multi-reference orchestration: Handles 2 videos + 4 images for deep fusion, locking in character consistency and style transfer—unique for high-frequency creative pipelines.
  • Six reference dimensions: Pixel-level control over effects, actions, and expressions, enabling "addition, deletion, and modification" without external tools like AE or C4D.
  • Fast, stable rendering: 3x faster generation with ultra-consistent characters and camera control, optimized for vidu-2-0-reference-to-video API integrations.

Key Considerations

  • Reference image quality and diversity significantly impact output realism and consistency
  • Best results are achieved with high-resolution, well-lit reference photos that clearly depict the subject’s features and intended motion cues
  • Prompt engineering is crucial: detailed scene descriptions and camera instructions yield more predictable and cinematic results
  • There is a trade-off between output quality and generation speed; higher fidelity settings may increase processing time
  • Consistency across frames is a key strength, but extreme pose or lighting changes between reference images can introduce artifacts
  • Iterative refinement (adjusting prompts or reference sets) is often necessary for optimal results

Tips & Tricks

How to Use vidu-2-0-reference-to-video on Eachlabs

Access vidu-2-0-reference-to-video seamlessly on Eachlabs via the Playground for instant testing—upload 2-4 reference images/videos, add a text prompt specifying motion or style, select 1080p resolution and duration up to 16 seconds. Integrate through the API or SDK for apps, receiving high-fidelity MP4 outputs with fused consistency in seconds.

---

Capabilities

  • Generates realistic, high-fidelity motion from static reference photos
  • Maintains consistent character identity and style across all video frames
  • Supports advanced camera grammar, including smooth push-ins, pull-backs, and tracking shots
  • Faithfully adheres to detailed prompts, minimizing semantic drift
  • Produces polished, cinematic short videos suitable for professional and creative applications
  • Handles both character-driven and product-focused scenes with high detail retention

What Can I Use It For?

Use Cases for vidu-2-0-reference-to-video

For filmmakers and animators, vidu-2-0-reference-to-video streamlines short drama production by fusing character reference images with action videos, ensuring identical appearances across scenes. Upload a face photo, pose video, and style image to generate a 10-second clip of the character walking through a cityscape, maintaining expressions and textures seamlessly—perfect for "AI video from multiple images" workflows.

Marketers building e-commerce visuals use it to animate product shots with reference textures and effects, creating dynamic demos like a watch rotating on a velvet surface with sparkling light reflections. This image-to-video AI model eliminates studio needs, delivering 1080p videos ready for ads.

Developers integrating Vidu image-to-video APIs craft interactive apps for designers, inputting 3-7 angle references for consistent object animations in prototypes. Example prompt: "Animate this product from front and side references with smooth 360-degree rotation, glossy metallic texture from image 3, in a modern showroom setting"—yielding coherent 16-second outputs for UI testing.

Content creators prototype narrative shorts by combining scene images and motion clips, achieving multi-shot structures with lip-synced elements for social media reels.

Things to Be Aware Of

  • Some experimental features, such as advanced camera motion or complex multi-character scenes, may yield variable results based on user feedback
  • Users have reported that reference consistency is generally strong, but extreme changes in input images can cause identity drift or motion artifacts
  • Performance is optimized for short clips; generating longer videos may require more memory and can introduce temporal inconsistencies
  • High-quality outputs may demand significant computational resources, especially at maximum resolution settings
  • Community feedback highlights the model’s strength in micro-expression fidelity and cinematic motion, with positive reviews for creative control and ease of use
  • Common concerns include occasional prompt drift in highly complex scenes and the need for iterative prompt refinement to achieve desired results

Limitations

  • Limited to short video clips (typically up to 8–10 seconds); not suitable for long-form video generation
  • May struggle with highly dynamic scenes, extreme pose changes, or inconsistent reference images
  • Requires careful prompt engineering and high-quality input images for optimal results; generic or low-quality inputs can reduce output fidelity

Pricing

Pricing Detail

This model runs at a cost of $0.005000 per execution.

Pricing Type: Fixed

The cost remains the same regardless of which model you use or how long it runs. There are no variables affecting the price. It is a set, fixed amount per run, as the name suggests. This makes budgeting simple and predictable because you pay the same fee every time you execute the model.