Alibaba Wan 2.7 · Reference to Video
Wan 2.7 Reference-to-Video generates videos with consistent character and object appearance from a reference image, supporting single or multi-shot scenes and optional motion guidance from video references.
- Runtime (p50)
- 8m
- Estimated price
- From $0.1
Overview
Alibaba | Wan 2.7 | Reference to Video revolutionizes video generation by producing high-quality videos with consistent character and object appearances from a single reference image, ideal for maintaining subject fidelity across multi-shot scenes. Developed by Alibaba Tongyi Lab as part of the advanced Wan 2.7 family, this video-to-video model stands out with its support for multi-reference inputs like 9-grid scenes and first/last frame control, enabling precise cinematic outputs up to 15-30 seconds. Unlike basic generators, it incorporates instruction-based editing and optional motion guidance from video references, solving key challenges in consistent character animation for creators and filmmakers. Available via the Alibaba | Wan 2.7 | Reference to Video API on platforms like each::labs, it empowers users to craft professional videos efficiently.
Capabilities
- Generates videos with consistent characters/objects from a single reference image across multi-shot scenes.
- Supports 9-grid multi-reference inputs for complex spatial arrangements and scene building.
- Provides first/last frame control for precise narrative structuring.
- Includes native lip-sync and audio generation synchronized with visuals.
- Enables instruction-based video editing, such as style transfer or element swaps using text prompts.
- Handles image-to-video with optional motion guidance from reference videos up to 15-30 seconds.
- Offers subject and voice cloning for personalized avatar animations.
- Supports flexible resolutions from 1080p to 4K with thinking mode for enhanced quality.
Use cases
For content creators: Produce YouTube intros with a consistent host avatar using 9-grid references: "Animate the reference character delivering a script across three shots, from close-up to wide angle, with lip-sync."
For marketers: Create product demo videos maintaining brand mascot fidelity: "From reference image, show mascot interacting with products in a 20-second sequence, guided by demo motion video."
For designers: Develop animated storyboards with first/last frame control: "Transition character from static pose A to action pose B in a multi-shot scene under studio lighting."
For developers: Integrate via Alibaba | Wan 2.7 | Reference to Video API for app prototypes, cloning user-uploaded faces with custom voices: "Generate personalized tutorial video from user photo and script." These leverage the model's multimodal editing for efficient, high-fidelity outputs on each::labs.
Tips & tricks
Optimize prompts for Alibaba | Wan 2.7 | Reference to Video by specifying "maintain subject consistency from reference image" to leverage its core strength in character fidelity. Use multi-reference grids (up to 9 images) for complex scenes, combining with first/last frame controls: "Generate a 15-second clip where the character from reference image walks from frame A to frame B under dramatic lighting." Enable thinking mode for text-to-video elements to improve reasoning and output quality, especially with long prompts up to 5,000 characters. For motion guidance, pair a short video reference with instructions like "Apply walking motion from video ref to static character image, add lip-sync dialogue." Workflow tip: Start with image-to-video base, then iterate via instruction editing for refinements. Test seeds for reproducibility in professional pipelines on each::labs.
Technical spec
- Resolution Support: Native 1080p HD, with capabilities up to 4K cinematic fidelity in advanced modes (e.g., Wan 2.7 Pro variants).
- Max Duration: 15-30 seconds per generation, extending beyond previous Wan 2.6 limits of 5-10 seconds.
- Aspect Ratios: Flexible, including standard video ratios like 1920x1080 and custom dimensions.
- Input/Output Formats: Accepts reference images (up to 9 in multi-reference grids), optional video for motion guidance, text prompts; outputs MP4 videos with native audio.
- Processing Time: Efficient rendering via Diffusion Transformer architecture with T5 encoder and MoE routing, suitable for cloud deployment without excessive GPU demands.
- Architecture: Multimodal Diffusion Transformer for contextual command processing and synchronous audio-visual flow matching.
Things to be aware of
Alibaba | Wan 2.7 | Reference to Video has a steeper learning curve due to advanced features like instruction editing and multi-grid inputs, requiring practice for optimal prompts. Edge cases include complex physics simulations, where trails may appear less refined than specialized models. Common mistakes: Overloading prompts without clear reference hierarchy, leading to inconsistent outputs—always prioritize subject consistency directives. Resource needs scale with duration and resolution; 4K pro modes demand more credits on each::labs. Test short clips first to avoid wasted generations in multi-shot workflows.
Key considerations
Before using Alibaba | Wan 2.7 | Reference to Video, ensure access to high-quality reference images for optimal subject consistency, as multi-shot scenes rely on clear inputs like 9-grid references. This model excels in scenarios requiring character persistence, such as short films or ads, over alternatives lacking first/last frame control. Processing via the Alibaba | Wan 2.7 | Reference to Video API balances speed and quality, with pro variants offering 4K at higher compute costs. Users should prioritize cloud platforms like each::labs for seamless integration, noting credit-based pricing starting around $10 for 100 credits. Best for teams handling instruction-based edits rather than raw physics simulations.
Limitations
Alibaba | Wan 2.7 | Reference to Video caps at 15-30 seconds, unsuitable for full-length videos. Physics handling lags behind top competitors in dynamic scenes, with occasional motion artifacts. No open weights yet—cloud-only via APIs like on each::labs, pending Q2 2026 release. Input limits to 9 reference images; complex edits may require multiple iterations. Audio sync excels in lip-sync but falters with heavy accents or non-frontal faces.

