VIDU-Q2

Vidu Reference-to-Image creates new images by combining a reference image with a prompt, preserving core identity while generating fresh, high-quality visual results.

Avg Run Time: 0.000s

Model Slug: vidu-q2-reference-to-image

Release Date: December 3, 2025

Playground

Input

Prompt*

Reference Image Urls*

Aspect Ratio

Seed

Output

Example Result

Preview and download your result.

Each execution costs $0.1000. With $1 you can run this model about 10 times.

API & SDK

Create a Prediction

Send a POST request to create a new prediction. This will return a prediction ID that you'll use to check the result. The request should include your model inputs and API key.

Get Prediction Result

Poll the prediction endpoint with the prediction ID until the result is ready. The API uses long-polling, so you'll need to repeatedly check until you receive a success status.

Readme

Table of Contents

Overview

Technical Specifications

Key Considerations

Tips & Tricks

Capabilities

What Can I Use It For?

Things to Be Aware Of

Limitations

Overview

vidu-q2-reference-to-image — Reference-to-Video AI Model

Vidu Q2 Reference Pro, officially launched globally on January 27, 2026, is a reference-to-video AI model that transforms how creators control video generation. Rather than relying on random outputs, this model lets you anchor video creation to reference images and videos, preserving identity and style while generating fresh, high-quality cinematic content. It solves the core pain point of traditional AI video generation: poor consistency and uncontrollable details. By combining reference images with text prompts, vidu-q2-reference-to-image enables pixel-level control over character appearance, scene composition, and visual effects—eliminating the need for manual post-production adjustments or multiple generation attempts.

The model represents a fundamental shift from "inspiration demonstration" to production-grade video creation, designed specifically for anime studios, short-form drama production, and film professionals who demand consistency across shot sequences.

Technical Specifications

What Sets vidu-q2-reference-to-image Apart

Multi-Reference Fusion Architecture: Unlike standard reference-to-video models, vidu-q2-reference-to-image supports simultaneous input of 2 reference videos and 4 reference images (up to 7 total references in ComfyUI workflows), with deep multimodal fusion that maintains consistency across all inputs. This enables complex multi-subject and multi-element video generation without requiring manual composition or separate generation passes.

Six Controllable Reference Dimensions: The model operates across six distinct reference categories—special effects, expressions, textures, actions, characters, and scenes—allowing creators to replicate and transfer styles at pixel-level precision. This eliminates the need for complex post-production tools like Cinema 4D or After Effects, making professional-grade video editing accessible through simple reference anchoring.

Production-Grade Control System: Designed for high-frequency creative workflows, vidu-q2-reference-to-image implements "controllable addition, deletion, and modification" engineering—meaning you actively intervene in creative work using reference materials as anchors, rather than passively waiting for generated results. This positions it as a true production engine rather than a generation toy.

Technical Specifications:

Supports up to 7 reference images in single workflow
Accepts multimodal input: reference videos, reference images, and text prompts
Delivers high-fidelity dynamic rendering with smoother large motions and believable physical feedback
Enables fine facial expressions, eye movement, and subtle gestures for expressive characters
Supports camera language control: push, pull, orbit, follow, and close-up strategies

Key Considerations

The reference image plays a critical role in determining composition, style, and identity; choose references that clearly represent the desired subject and pose
Prompt quality significantly affects results; use clear, descriptive language that complements rather than contradicts the reference
Overly complex or conflicting prompts can lead to artifacts or reduced fidelity; keep prompts focused on the desired changes rather than redefining the entire scene
For best identity preservation, use high-quality, well-lit reference images with clear subject separation
There is a trade-off between creative freedom and consistency; more detailed prompts can increase variation but may reduce reference fidelity
The model works best when the prompt and reference are semantically aligned (e.g., a prompt about a character in a specific outfit used with a reference of that character)
Iterative refinement—generating multiple variations and selecting the best—often yields better results than expecting perfection in a single run

Tips & Tricks

How to Use vidu-q2-reference-to-image on Eachlabs

Access vidu-q2-reference-to-image through Eachlabs' Playground or API. Provide your reference images (up to 7), optional reference videos, and a text prompt describing your desired output. The model processes multimodal inputs and returns high-fidelity video with strong subject identity preservation and temporal coherence. Use Eachlabs' SDK for programmatic integration into production workflows, or the Playground for interactive experimentation with reference weights and composition adjustments.

Capabilities

Generates high-quality images by combining a reference image with a text prompt
Preserves core identity, pose, and structure from the reference while allowing creative variation
Supports detailed style transfer and aesthetic changes while maintaining subject consistency
Produces outputs suitable for professional visual content creation, including character design, product visualization, and concept art
Handles complex prompts involving multiple objects, environments, and lighting conditions when aligned with the reference
Delivers strong performance in preserving facial features, clothing, and object details from the reference
Works effectively for both realistic and stylized outputs, including anime, illustration, and photorealistic styles
Enables rapid iteration of visual concepts by reusing references with different prompts

What Can I Use It For?

Use Cases for vidu-q2-reference-to-image

Anime and Short-Form Drama Production: Studios can feed character reference images plus a prompt like "character walks through a neon-lit cyberpunk street, nervous expression, rain falling" to generate consistent character shots across multiple scenes. The expression reference dimension ensures emotional continuity, while the action reference locks movement quality—reducing the need for manual frame-by-frame correction.

E-Commerce Product Video Generation: Marketing teams building an AI video generator for product showcases can use scene and texture references to place products in photorealistic environments. A prompt like "luxury watch on marble countertop, morning sunlight, shallow depth of field" combined with product and lighting references produces studio-quality output without expensive location shoots or professional photography.

Film and Commercial Production: Directors can establish visual consistency across shots by setting a main reference video (establishing the tone and lighting), then extracting independent references for special effects, character actions, and scene composition. This "reference-first" workflow minimizes visual drift between shots and ensures that multi-reference video generation maintains narrative coherence when multiple subjects appear together.

Digital Avatar and Talking Head Creation: Content creators can lock character identity using character references while feeding expression and action references to drive natural facial movements and gestures. This enables consistent, expressive digital avatars for streaming, education, or customer service applications without the stiffness typical of AI-generated performances.

Things to Be Aware Of

The model is still evolving, and some behaviors (especially around extreme pose changes or complex interactions) may be experimental
Users report that very dramatic changes in pose or perspective relative to the reference can lead to distortions or loss of identity
Fine details like small text, intricate patterns, or subtle facial expressions may not always be perfectly preserved
Performance can vary depending on reference quality; low-resolution, blurry, or heavily compressed references reduce output fidelity
Some users note that the model performs best when the prompt does not contradict the reference (e.g., asking for a “cat” when the reference shows a person)
Consistency across multiple generations is generally strong for the same reference and similar prompts, but minor variations in facial features or proportions can occur
Recent user feedback highlights strong satisfaction with identity preservation and prompt adherence, especially for character and product use cases
Common concerns include occasional over-smoothing of textures and challenges with highly detailed or cluttered reference images
Resource requirements are moderate to high for high-resolution outputs, and generation speed depends on available compute

Limitations

Struggles with extreme pose or viewpoint changes that are very different from the reference image
May not perfectly preserve very fine details (e.g., small text, intricate logos) when the prompt introduces significant style or environment changes

Pricing

Pricing Detail

This model runs at a cost of $0.10 per execution.

Pricing Type: Fixed

The cost remains the same regardless of which model you use or how long it runs. There are no variables affecting the price. It is a set, fixed amount per run, as the name suggests. This makes budgeting simple and predictable because you pay the same fee every time you execute the model.

AI TRENDS

Related AI Models

You can seamlessly integrate advanced AI capabilities into your applications without the hassle of managing complex infrastructure.

Image to Image

P-image Edit is an image editing model that applies precise, high-quality edits from text prompts with fast performance and consistent results, built for production use cases.

P Image | Edit

6 s

Image to Image

Creates a polished, high-resolution professional profile portrait with preserved identity, studio-quality lighting, elegant styling, and a clean cinematic look. Upload your photo to generate a sharp, modern, and approachable chest-up headshot.

Nano Banana Pro - Photoshoot

35 s

Image to Image

Generates images from text and reference images using Tongyi-MAI’s ultra-fast 6B Z-Image Turbo model for fast, high-quality visual results.

Z Image | Turbo | Image to Image

10 s

Image to Image

A utility endpoint that crops images efficiently for workflow processing, enabling precise framing and clean image preparation.

Crop Image

10 s

Explore More