each::sense is in private beta.
Eachlabs | AI Workflows for app builders

VIDU-2.0

Vidu 2.0 Reference to Video generates realistic motion by combining multiple reference photos into a seamless video.

Avg Run Time: 40.000s

Model Slug: vidu-2-0-reference-to-video

Playground

Input

Enter a URL or choose a file from your computer.

Enter a URL or choose a file from your computer.

Enter a URL or choose a file from your computer.

Enter a URL or choose a file from your computer.

Advanced Controls

Output

Example Result

Preview and download your result.

Each execution costs $0.005000. With $1 you can run this model about 200 times.

API & SDK

Create a Prediction

Send a POST request to create a new prediction. This will return a prediction ID that you'll use to check the result. The request should include your model inputs and API key.

Get Prediction Result

Poll the prediction endpoint with the prediction ID until the result is ready. The API uses long-polling, so you'll need to repeatedly check until you receive a success status.

Readme

Table of Contents
Overview
Technical Specifications
Key Considerations
Tips & Tricks
Capabilities
What Can I Use It For?
Things to Be Aware Of
Limitations

Overview

Vidu 2.0 Reference to Video is an advanced AI image-to-video generation model designed to synthesize realistic motion by combining multiple reference photos into seamless, cinematic video clips. Developed as a next-generation solution for creators and professionals, the model focuses on high-fidelity motion, consistent identity preservation, and smooth camera dynamics. It is engineered to address the growing demand for controllable, high-quality short video generation from static images or photo sets.

Key features include the ability to maintain micro-expressions, stable character identity, and physically plausible motion across frames. The model leverages a sophisticated architecture that integrates reference image analysis, motion synthesis, and advanced video rendering pipelines. Its unique strengths lie in its ability to produce polished, short-form videos (typically 2–8 seconds) with refined camera grammar, minimal prompt drift, and faithful adherence to creative intent. Vidu 2.0 stands out for its focus on cinematic quality, making it particularly suitable for character-driven and product-centric video content.

The underlying technology combines state-of-the-art generative models for motion transfer and video synthesis, with specialized modules for reference consistency and camera movement. This results in outputs that are not only visually compelling but also technically robust, offering creators a reliable tool for rapid ideation and high-quality video production.

Technical Specifications

  • Architecture: Advanced generative video synthesis model with reference image analysis and motion transfer modules (specific architecture details not publicly disclosed)
  • Parameters: Not publicly specified
  • Resolution: Supports up to 1080p-class video outputs
  • Input/Output formats: Input via reference images (JPG/PNG), output as short video clips (MP4 or similar standard video formats)
  • Performance metrics: High fidelity in micro-expressions, stable identity across frames, smooth camera motion, typical clip length 2–8 seconds, optimized for short polished outputs

Key Considerations

  • Reference image quality and diversity significantly impact output realism and consistency
  • Best results are achieved with high-resolution, well-lit reference photos that clearly depict the subject’s features and intended motion cues
  • Prompt engineering is crucial: detailed scene descriptions and camera instructions yield more predictable and cinematic results
  • There is a trade-off between output quality and generation speed; higher fidelity settings may increase processing time
  • Consistency across frames is a key strength, but extreme pose or lighting changes between reference images can introduce artifacts
  • Iterative refinement (adjusting prompts or reference sets) is often necessary for optimal results

Tips & Tricks

  • Use multiple reference images from similar angles and lighting conditions to maximize identity consistency
  • Structure prompts to specify desired camera movements (e.g., “smooth push-in,” “tracking shot”) and emotional cues (e.g., “subtle smile,” “gentle blink”)
  • For cinematic effects, include environmental details and lighting instructions in the prompt
  • Start with short clip durations (2–4 seconds) to test settings before generating longer sequences
  • Adjust prompt specificity to control for motion intensity and scene complexity; overly vague prompts may result in generic outputs
  • If artifacts appear, try refining the reference image set or simplifying the scene description

Capabilities

  • Generates realistic, high-fidelity motion from static reference photos
  • Maintains consistent character identity and style across all video frames
  • Supports advanced camera grammar, including smooth push-ins, pull-backs, and tracking shots
  • Faithfully adheres to detailed prompts, minimizing semantic drift
  • Produces polished, cinematic short videos suitable for professional and creative applications
  • Handles both character-driven and product-focused scenes with high detail retention

What Can I Use It For?

  • Professional product showcase videos where consistent branding and motion are critical
  • Character animation for marketing, entertainment, or social media content
  • Rapid prototyping of cinematic scenes for storyboarding and pre-visualization
  • Creative projects such as animated portraits, music video snippets, or digital art exhibitions
  • Industry applications in advertising, e-commerce, and virtual influencer content
  • Personal projects including animated family photos, cosplay showcases, and fan art videos

Things to Be Aware Of

  • Some experimental features, such as advanced camera motion or complex multi-character scenes, may yield variable results based on user feedback
  • Users have reported that reference consistency is generally strong, but extreme changes in input images can cause identity drift or motion artifacts
  • Performance is optimized for short clips; generating longer videos may require more memory and can introduce temporal inconsistencies
  • High-quality outputs may demand significant computational resources, especially at maximum resolution settings
  • Community feedback highlights the model’s strength in micro-expression fidelity and cinematic motion, with positive reviews for creative control and ease of use
  • Common concerns include occasional prompt drift in highly complex scenes and the need for iterative prompt refinement to achieve desired results

Limitations

  • Limited to short video clips (typically up to 8–10 seconds); not suitable for long-form video generation
  • May struggle with highly dynamic scenes, extreme pose changes, or inconsistent reference images
  • Requires careful prompt engineering and high-quality input images for optimal results; generic or low-quality inputs can reduce output fidelity

Pricing

Pricing Detail

This model runs at a cost of $0.005000 per execution.

Pricing Type: Fixed

The cost remains the same regardless of which model you use or how long it runs. There are no variables affecting the price. It is a set, fixed amount per run, as the name suggests. This makes budgeting simple and predictable because you pay the same fee every time you execute the model.