WAN-2.7
Wan 2.7 Reference-to-Video generates videos with consistent character and object appearance from a reference image, supporting single or multi-shot scenes and optional motion guidance from video references.
Avg Run Time: 500.000s
Model Slug: alibaba-wan-2-7-reference-to-video
Release Date: April 3, 2026
Playground
Input
Output
Example Result
Preview and download your result.
API & SDK
Create a Prediction
Send a POST request to create a new prediction. This will return a prediction ID that you'll use to check the result. The request should include your model inputs and API key.
Get Prediction Result
Poll the prediction endpoint with the prediction ID until the result is ready. The API uses long-polling, so you'll need to repeatedly check until you receive a success status.
Readme
Overview
Alibaba | Wan 2.7 | Reference to Video revolutionizes video generation by producing high-quality videos with consistent character and object appearances from a single reference image, ideal for maintaining subject fidelity across multi-shot scenes. Developed by Alibaba Tongyi Lab as part of the advanced Wan 2.7 family, this video-to-video model stands out with its support for multi-reference inputs like 9-grid scenes and first/last frame control, enabling precise cinematic outputs up to 15-30 seconds. Unlike basic generators, it incorporates instruction-based editing and optional motion guidance from video references, solving key challenges in consistent character animation for creators and filmmakers. Available via the Alibaba | Wan 2.7 | Reference to Video API on platforms like each::labs, it empowers users to craft professional videos efficiently.
Technical Specifications
- Resolution Support: Native 1080p HD, with capabilities up to 4K cinematic fidelity in advanced modes (e.g., Wan 2.7 Pro variants).
- Max Duration: 15-30 seconds per generation, extending beyond previous Wan 2.6 limits of 5-10 seconds.
- Aspect Ratios: Flexible, including standard video ratios like 1920x1080 and custom dimensions.
- Input/Output Formats: Accepts reference images (up to 9 in multi-reference grids), optional video for motion guidance, text prompts; outputs MP4 videos with native audio.
- Processing Time: Efficient rendering via Diffusion Transformer architecture with T5 encoder and MoE routing, suitable for cloud deployment without excessive GPU demands.
- Architecture: Multimodal Diffusion Transformer for contextual command processing and synchronous audio-visual flow matching.
Key Considerations
Before using Alibaba | Wan 2.7 | Reference to Video, ensure access to high-quality reference images for optimal subject consistency, as multi-shot scenes rely on clear inputs like 9-grid references. This model excels in scenarios requiring character persistence, such as short films or ads, over alternatives lacking first/last frame control. Processing via the Alibaba | Wan 2.7 | Reference to Video API balances speed and quality, with pro variants offering 4K at higher compute costs. Users should prioritize cloud platforms like each::labs for seamless integration, noting credit-based pricing starting around $10 for 100 credits. Best for teams handling instruction-based edits rather than raw physics simulations.
Tips & Tricks
Optimize prompts for Alibaba | Wan 2.7 | Reference to Video by specifying "maintain subject consistency from reference image" to leverage its core strength in character fidelity. Use multi-reference grids (up to 9 images) for complex scenes, combining with first/last frame controls: "Generate a 15-second clip where the character from reference image walks from frame A to frame B under dramatic lighting." Enable thinking mode for text-to-video elements to improve reasoning and output quality, especially with long prompts up to 5,000 characters. For motion guidance, pair a short video reference with instructions like "Apply walking motion from video ref to static character image, add lip-sync dialogue." Workflow tip: Start with image-to-video base, then iterate via instruction editing for refinements. Test seeds for reproducibility in professional pipelines on each::labs.
Capabilities
- Generates videos with consistent characters/objects from a single reference image across multi-shot scenes.
- Supports 9-grid multi-reference inputs for complex spatial arrangements and scene building.
- Provides first/last frame control for precise narrative structuring.
- Includes native lip-sync and audio generation synchronized with visuals.
- Enables instruction-based video editing, such as style transfer or element swaps using text prompts.
- Handles image-to-video with optional motion guidance from reference videos up to 15-30 seconds.
- Offers subject and voice cloning for personalized avatar animations.
- Supports flexible resolutions from 1080p to 4K with thinking mode for enhanced quality.
What Can I Use It For?
For content creators: Produce YouTube intros with a consistent host avatar using 9-grid references: "Animate the reference character delivering a script across three shots, from close-up to wide angle, with lip-sync."
For marketers: Create product demo videos maintaining brand mascot fidelity: "From reference image, show mascot interacting with products in a 20-second sequence, guided by demo motion video."
For designers: Develop animated storyboards with first/last frame control: "Transition character from static pose A to action pose B in a multi-shot scene under studio lighting."
For developers: Integrate via Alibaba | Wan 2.7 | Reference to Video API for app prototypes, cloning user-uploaded faces with custom voices: "Generate personalized tutorial video from user photo and script." These leverage the model's multimodal editing for efficient, high-fidelity outputs on each::labs.
Things to Be Aware Of
Alibaba | Wan 2.7 | Reference to Video has a steeper learning curve due to advanced features like instruction editing and multi-grid inputs, requiring practice for optimal prompts. Edge cases include complex physics simulations, where trails may appear less refined than specialized models. Common mistakes: Overloading prompts without clear reference hierarchy, leading to inconsistent outputs—always prioritize subject consistency directives. Resource needs scale with duration and resolution; 4K pro modes demand more credits on each::labs. Test short clips first to avoid wasted generations in multi-shot workflows.
Limitations
Alibaba | Wan 2.7 | Reference to Video caps at 15-30 seconds, unsuitable for full-length videos. Physics handling lags behind top competitors in dynamic scenes, with occasional motion artifacts. No open weights yet—cloud-only via APIs like on each::labs, pending Q2 2026 release. Input limits to 9 reference images; complex edits may require multiple iterations. Audio sync excels in lip-sync but falters with heavy accents or non-frontal faces.
Pricing
Pricing Type: Dynamic
1080P pricing: $0.15/sec (default)
Current Pricing
Pricing Rules
| Condition | Pricing |
|---|---|
resolution matches "720P" | 720P pricing: $0.10/sec |
Rule 2(Active) | 1080P pricing: $0.15/sec (default) |
Related AI Models
You can seamlessly integrate advanced AI capabilities into your applications without the hassle of managing complex infrastructure.
