WAN V2.6
Wan 2.6 is a reference-to-video model that generates high-quality videos while preserving visual style, motion, and scene consistency from a reference input.
Avg Run Time: 320.000s
Model Slug: wan-v2-6-reference-to-video
Release Date: December 16, 2025
Playground
Input
Output
Example Result
Preview and download your result.
API & SDK
Create a Prediction
Send a POST request to create a new prediction. This will return a prediction ID that you'll use to check the result. The request should include your model inputs and API key.
Get Prediction Result
Poll the prediction endpoint with the prediction ID until the result is ready. The API uses long-polling, so you'll need to repeatedly check until you receive a success status.
Readme
Overview
wan-v2.6-reference-to-video — Reference-to-Video AI Model
Developed by Alibaba as part of the wan v2.6 family, wan-v2.6-reference-to-video is a reference-to-video model that generates high-quality videos while preserving visual style, motion patterns, and voice characteristics from uploaded reference videos. Unlike traditional image-to-video or text-to-video approaches that rely on static images or text descriptions alone, this model extracts appearance, movement, and audio features from reference material—enabling creators to maintain consistent character identity and stylistic elements across generated video sequences. This capability solves a critical problem for content creators: generating multiple video variations that feel cohesive and on-brand without manual reshooting or complex post-production work.
The model accepts up to 3 reference videos (2–30 seconds each) alongside text prompts, making it ideal for creators building AI video generators for professional workflows. It produces multi-shot narrative videos up to 15 seconds at 720p or 1080p resolution with synchronized audio, delivering cinematic quality suitable for commercial applications.
Technical Specifications
What Sets wan-v2.6-reference-to-video Apart
Multi-Reference Video Input: Unlike most image-to-video AI models that accept only static images, wan-v2.6-reference-to-video processes up to 3 reference videos simultaneously. The model intelligently extracts appearance, movement patterns, and voice characteristics from each reference, then applies these features consistently to newly generated videos. This eliminates the need for manual character or style consistency checks across multiple takes.
Native Audio-Video Synchronization: The model generates videos with automatically synchronized audio, including dialogue, ambient sound, and effects matched to scene context. This integrated approach removes the friction of separate audio generation and manual syncing—a significant advantage for developers building production-scale AI video generation APIs.
Multi-Shot Narrative with Scene Continuity: wan-v2.6-reference-to-video understands storyboard-style prompts and generates coherent multi-shot sequences with smooth transitions and natural camera movements. This capability transforms fragmented clips into cinematic narratives, making it particularly valuable for marketing teams and content creators producing professional short-form video content.
Technical Specifications:
- Resolution: 720p or 1080p
- Video Duration: 2–15 seconds (integer values)
- Reference Input: Up to 3 videos (2–30 seconds each)
- Output Format: MP4 (H.264 encoding, 30 fps)
- Audio Support: Native generation with lip-sync and scene-matched effects
Key Considerations
- Use high-quality reference videos (at least 5 seconds) for optimal character and motion replication to maintain consistency across shots
- Best practices include starting with simple prompts for scene planning, then refining iteratively with specific shot types and negative prompts
- Common pitfalls: Overloading with too many references can reduce stability; limit to one primary video and supplementary images
- Quality vs speed trade-offs: Higher durations (up to 15s) increase generation time but enable fuller narratives; prioritize 1080p for production use
- Prompt engineering tips: Describe desired actions, lighting, and camera movements explicitly (e.g., "dance battle with cinematic lighting, dynamic camera"); use prompt_extend for auto-expansion
Tips & Tricks
How to Use wan-v2.6-reference-to-video on Eachlabs
Access wan-v2.6-reference-to-video through Eachlabs' Playground or API. Provide up to 3 reference videos (2–30 seconds), a text prompt describing your desired output, and specify resolution (720p or 1080p) and duration (2–15 seconds). The model generates a synchronized video with audio, delivered as an MP4 file. Use the Eachlabs SDK or REST API to integrate reference-to-video generation directly into your application, enabling production-scale video creation workflows.
---END---Capabilities
- Generates high-fidelity 1080p, 24fps videos with fluid motion, sharp details, and film-style lighting from references
- Precise lip-sync and native audio generation for voiceovers, music, and effects perfectly aligned frame-by-frame
- Multi-shot storytelling with automatic scene planning, seamless transitions, and consistent characters across shots
- Reference video replication for cloning subjects (people, animals, objects) including look, voice, and motion
- Versatile modes: reference-to-video, image-to-video, text-to-video with multimodal integration for professional outputs
- Strong temporal coherence and stability, especially in motion transfer and multi-reference guidance
What Can I Use It For?
Use Cases for wan-v2.6-reference-to-video
Character-Driven Content Creation: Animators and character designers can upload reference videos of a character performing specific movements, then generate variations with different backgrounds, lighting, or scenarios while maintaining the character's appearance and motion style. For example, a creator might input a reference video of a character walking and request "the same character walking through a futuristic city at sunset"—the model preserves the character's gait and appearance while adapting the environment.
Brand-Consistent Marketing Videos: Marketing teams building an AI video generator for e-commerce can use reference videos of product demonstrations or brand spokespersons to generate multiple campaign variations. By feeding a reference video of a product unboxing plus prompts like "show this product being used in a modern home office with natural lighting," teams produce on-brand content at scale without reshooting.
Voice and Style Preservation for Creators: Content creators and YouTubers can upload reference videos capturing their speaking style, facial expressions, and voice characteristics, then generate new video content in different settings or scenarios. This enables rapid iteration on video ideas while maintaining personal brand consistency—critical for creators managing multiple content series.
API Integration for Video Editing Platforms: Developers building AI-powered video editing tools can integrate wan-v2.6-reference-to-video to offer users reference-based generation as a core feature. The model's support for multiple reference inputs and native audio synchronization makes it suitable for professional workflows requiring consistent output quality and minimal post-processing.
Things to Be Aware Of
- Experimental multi-reference support works best with complementary inputs; mixing disparate styles may cause minor inconsistencies
- Known quirks: Longer durations (15s) can occasionally show subtle motion drift in complex actions, per user tests
- Performance considerations: Stable on standard hardware via optimized pipelines, but multi-shot increases compute needs
- Resource requirements: Handles 1080p efficiently; users report quick generations for 5-10s clips
- Consistency factors: Excels in one/two-person shots; praised for clone-level subject preservation
- Positive user feedback themes: "Game-changing for lip-sync accuracy" and "Seamless multi-shot flow" from recent discussions
- Common concerns: Rare audio desync in noisy references; mitigated by clean inputs
Limitations
- Limited to 15-second videos, requiring stitching for longer content
- Optimal for one/two-person scenes; complex crowds or rapid multi-subject interactions may lose some fidelity
- Relies heavily on reference quality; low-res or blurry inputs degrade output consistency
Pricing
Pricing Type: Dynamic
1080p resolution: duration * $0.15 per second from output video
Related AI Models
You can seamlessly integrate advanced AI capabilities into your applications without the hassle of managing complex infrastructure.
