Vidu Q2 | Reference to Image

each::sense is in private beta.
Eachlabs | AI Workflows for app builders
vidu-q2-reference-to-image

VIDU-Q2

Vidu Reference-to-Image creates new images by combining a reference image with a prompt, preserving core identity while generating fresh, high-quality visual results.

Avg Run Time: 0.000s

Model Slug: vidu-q2-reference-to-image

Release Date: December 3, 2025

Playground

Input

Output

Example Result

Preview and download your result.

Preview
Each execution costs $0.1000. With $1 you can run this model about 10 times.

API & SDK

Create a Prediction

Send a POST request to create a new prediction. This will return a prediction ID that you'll use to check the result. The request should include your model inputs and API key.

Get Prediction Result

Poll the prediction endpoint with the prediction ID until the result is ready. The API uses long-polling, so you'll need to repeatedly check until you receive a success status.

Readme

Table of Contents
Overview
Technical Specifications
Key Considerations
Tips & Tricks
Capabilities
What Can I Use It For?
Things to Be Aware Of
Limitations

Overview

Vidu Q2 Reference-to-Image is an AI image generation model developed by Shengshu Technology in collaboration with Tsinghua University. It is designed to create new images by combining a reference image with a text prompt, preserving the core identity, structure, and key visual elements of the reference while generating fresh, high-quality visual content. The model is part of the broader Vidu Q2 family, which includes text-to-image, image-to-video, and reference-to-video capabilities, with the reference-to-image variant specifically optimized for controlled, identity-preserving image synthesis.

Key capabilities include high-fidelity style and content transfer, strong adherence to reference composition, and the ability to generate diverse outputs while maintaining consistency with the input image. The model leverages a diffusion-based architecture enhanced with reference conditioning mechanisms that allow it to interpret both visual and textual inputs effectively. What makes it unique is its strong performance in preserving character and object identity from the reference, making it particularly well-suited for applications requiring consistent character or product rendering across variations. It supports both prompt-driven creative exploration and precise visual editing, with an emphasis on natural motion and realistic detail when used in video-oriented workflows.

Technical Specifications

Architecture
Diffusion-based architecture with reference conditioning
Parameters
Not publicly disclosed (part of a larger multimodal Q2 model family)
Resolution
Supports high-resolution outputs, commonly used at 720p and higher for image and video applications
Input/Output formats
Input: Reference image (common formats: PNG, JPG) + text prompt; Output: Generated image (PNG, JPG)
Performance metrics
Generation time in the range of seconds to tens of seconds per image depending on resolution and hardware; designed for fast inference with strong prompt and reference adherence

Key Considerations

  • The reference image plays a critical role in determining composition, style, and identity; choose references that clearly represent the desired subject and pose
  • Prompt quality significantly affects results; use clear, descriptive language that complements rather than contradicts the reference
  • Overly complex or conflicting prompts can lead to artifacts or reduced fidelity; keep prompts focused on the desired changes rather than redefining the entire scene
  • For best identity preservation, use high-quality, well-lit reference images with clear subject separation
  • There is a trade-off between creative freedom and consistency; more detailed prompts can increase variation but may reduce reference fidelity
  • The model works best when the prompt and reference are semantically aligned (e.g., a prompt about a character in a specific outfit used with a reference of that character)
  • Iterative refinement—generating multiple variations and selecting the best—often yields better results than expecting perfection in a single run

Tips & Tricks

  • Use the reference image to lock in character, product, or scene identity, then use the prompt to change environment, lighting, or minor attributes (e.g., “same character, now in a forest at sunset”)
  • For subtle edits, keep the prompt concise and focused on the change (e.g., “same person, wearing a red jacket” rather than a full scene rewrite)
  • To increase creativity while maintaining identity, use prompts that describe mood or style (e.g., “cinematic lighting,” “anime style,” “cyberpunk aesthetic”) rather than completely new compositions
  • When generating multiple images of the same subject, reuse the same reference and vary only the prompt to maintain consistency across outputs
  • For better text rendering in generated images, combine the reference with prompts that explicitly describe typography, layout, and color
  • If the output is too similar to the reference, slightly increase prompt strength or add more specific details; if it diverges too much, simplify the prompt and rely more on the reference
  • Use low to medium guidance scales for natural-looking results; very high guidance can introduce artifacts or unnatural distortions

Capabilities

  • Generates high-quality images by combining a reference image with a text prompt
  • Preserves core identity, pose, and structure from the reference while allowing creative variation
  • Supports detailed style transfer and aesthetic changes while maintaining subject consistency
  • Produces outputs suitable for professional visual content creation, including character design, product visualization, and concept art
  • Handles complex prompts involving multiple objects, environments, and lighting conditions when aligned with the reference
  • Delivers strong performance in preserving facial features, clothing, and object details from the reference
  • Works effectively for both realistic and stylized outputs, including anime, illustration, and photorealistic styles
  • Enables rapid iteration of visual concepts by reusing references with different prompts

What Can I Use It For?

  • Creating character variations for animation, games, or comics using a single reference pose
  • Generating product mockups in different environments or lighting conditions from a base product image
  • Designing marketing visuals with consistent branding and character identity across campaigns
  • Developing concept art and storyboards where scene composition is fixed but mood or setting changes
  • Producing social media content such as themed avatars, seasonal variations, or promotional images
  • Generating educational or instructional visuals with consistent characters or diagrams
  • Creating personalized gifts or merchandise designs based on a reference photo
  • Supporting fashion and apparel design by visualizing outfits on a reference model in different settings
  • Building consistent visual assets for video projects, including character sheets and background variations

Things to Be Aware Of

  • The model is still evolving, and some behaviors (especially around extreme pose changes or complex interactions) may be experimental
  • Users report that very dramatic changes in pose or perspective relative to the reference can lead to distortions or loss of identity
  • Fine details like small text, intricate patterns, or subtle facial expressions may not always be perfectly preserved
  • Performance can vary depending on reference quality; low-resolution, blurry, or heavily compressed references reduce output fidelity
  • Some users note that the model performs best when the prompt does not contradict the reference (e.g., asking for a “cat” when the reference shows a person)
  • Consistency across multiple generations is generally strong for the same reference and similar prompts, but minor variations in facial features or proportions can occur
  • Recent user feedback highlights strong satisfaction with identity preservation and prompt adherence, especially for character and product use cases
  • Common concerns include occasional over-smoothing of textures and challenges with highly detailed or cluttered reference images
  • Resource requirements are moderate to high for high-resolution outputs, and generation speed depends on available compute

Limitations

  • Struggles with extreme pose or viewpoint changes that are very different from the reference image
  • May not perfectly preserve very fine details (e.g., small text, intricate logos) when the prompt introduces significant style or environment changes

Pricing

Pricing Detail

This model runs at a cost of $0.10 per execution.

Pricing Type: Fixed

The cost remains the same regardless of which model you use or how long it runs. There are no variables affecting the price. It is a set, fixed amount per run, as the name suggests. This makes budgeting simple and predictable because you pay the same fee every time you execute the model.