WAN V2.6

Wan 2.6 is a reference-to-video model that generates high-quality videos while preserving visual style, motion, and scene consistency from a reference input.

Avg Run Time: 320.000s

Model Slug: wan-v2-6-reference-to-video

Release Date: December 16, 2025

Playground

Input

Prompt*

Video Urls*

Aspect Ratio

Resolution

Duration

Negative Prompt

Enable Prompt Expansion

Multi Shots

Seed

Enable Safety Checker

Output

Example Result

Preview and download your result.

Unsupported conditions - pricing not available for this input format

API & SDK

Create a Prediction

Send a POST request to create a new prediction. This will return a prediction ID that you'll use to check the result. The request should include your model inputs and API key.

Get Prediction Result

Poll the prediction endpoint with the prediction ID until the result is ready. The API uses long-polling, so you'll need to repeatedly check until you receive a success status.

Readme

Table of Contents

Overview

Technical Specifications

Key Considerations

Tips & Tricks

Capabilities

What Can I Use It For?

Things to Be Aware Of

Limitations

Overview

wan-v2.6-reference-to-video — Reference-to-Video AI Model

Developed by Alibaba as part of the wan v2.6 family, wan-v2.6-reference-to-video is a reference-to-video model that generates high-quality videos while preserving visual style, motion patterns, and voice characteristics from uploaded reference videos. Unlike traditional image-to-video or text-to-video approaches that rely on static images or text descriptions alone, this model extracts appearance, movement, and audio features from reference material—enabling creators to maintain consistent character identity and stylistic elements across generated video sequences. This capability solves a critical problem for content creators: generating multiple video variations that feel cohesive and on-brand without manual reshooting or complex post-production work.

The model accepts up to 3 reference videos (2–30 seconds each) alongside text prompts, making it ideal for creators building AI video generators for professional workflows. It produces multi-shot narrative videos up to 15 seconds at 720p or 1080p resolution with synchronized audio, delivering cinematic quality suitable for commercial applications.

Technical Specifications

What Sets wan-v2.6-reference-to-video Apart

Multi-Reference Video Input: Unlike most image-to-video AI models that accept only static images, wan-v2.6-reference-to-video processes up to 3 reference videos simultaneously. The model intelligently extracts appearance, movement patterns, and voice characteristics from each reference, then applies these features consistently to newly generated videos. This eliminates the need for manual character or style consistency checks across multiple takes.

Native Audio-Video Synchronization: The model generates videos with automatically synchronized audio, including dialogue, ambient sound, and effects matched to scene context. This integrated approach removes the friction of separate audio generation and manual syncing—a significant advantage for developers building production-scale AI video generation APIs.

Multi-Shot Narrative with Scene Continuity: wan-v2.6-reference-to-video understands storyboard-style prompts and generates coherent multi-shot sequences with smooth transitions and natural camera movements. This capability transforms fragmented clips into cinematic narratives, making it particularly valuable for marketing teams and content creators producing professional short-form video content.

Technical Specifications:

Resolution: 720p or 1080p
Video Duration: 2–15 seconds (integer values)
Reference Input: Up to 3 videos (2–30 seconds each)
Output Format: MP4 (H.264 encoding, 30 fps)
Audio Support: Native generation with lip-sync and scene-matched effects

Key Considerations

Use high-quality reference videos (at least 5 seconds) for optimal character and motion replication to maintain consistency across shots
Best practices include starting with simple prompts for scene planning, then refining iteratively with specific shot types and negative prompts
Common pitfalls: Overloading with too many references can reduce stability; limit to one primary video and supplementary images
Quality vs speed trade-offs: Higher durations (up to 15s) increase generation time but enable fuller narratives; prioritize 1080p for production use
Prompt engineering tips: Describe desired actions, lighting, and camera movements explicitly (e.g., "dance battle with cinematic lighting, dynamic camera"); use prompt_extend for auto-expansion

Tips & Tricks

How to Use wan-v2.6-reference-to-video on Eachlabs

Access wan-v2.6-reference-to-video through Eachlabs' Playground or API. Provide up to 3 reference videos (2–30 seconds), a text prompt describing your desired output, and specify resolution (720p or 1080p) and duration (2–15 seconds). The model generates a synchronized video with audio, delivered as an MP4 file. Use the Eachlabs SDK or REST API to integrate reference-to-video generation directly into your application, enabling production-scale video creation workflows.

---END---

Capabilities

Generates high-fidelity 1080p, 24fps videos with fluid motion, sharp details, and film-style lighting from references
Precise lip-sync and native audio generation for voiceovers, music, and effects perfectly aligned frame-by-frame
Multi-shot storytelling with automatic scene planning, seamless transitions, and consistent characters across shots
Reference video replication for cloning subjects (people, animals, objects) including look, voice, and motion
Versatile modes: reference-to-video, image-to-video, text-to-video with multimodal integration for professional outputs
Strong temporal coherence and stability, especially in motion transfer and multi-reference guidance

What Can I Use It For?

Use Cases for wan-v2.6-reference-to-video

Character-Driven Content Creation: Animators and character designers can upload reference videos of a character performing specific movements, then generate variations with different backgrounds, lighting, or scenarios while maintaining the character's appearance and motion style. For example, a creator might input a reference video of a character walking and request "the same character walking through a futuristic city at sunset"—the model preserves the character's gait and appearance while adapting the environment.

Brand-Consistent Marketing Videos: Marketing teams building an AI video generator for e-commerce can use reference videos of product demonstrations or brand spokespersons to generate multiple campaign variations. By feeding a reference video of a product unboxing plus prompts like "show this product being used in a modern home office with natural lighting," teams produce on-brand content at scale without reshooting.

Voice and Style Preservation for Creators: Content creators and YouTubers can upload reference videos capturing their speaking style, facial expressions, and voice characteristics, then generate new video content in different settings or scenarios. This enables rapid iteration on video ideas while maintaining personal brand consistency—critical for creators managing multiple content series.

API Integration for Video Editing Platforms: Developers building AI-powered video editing tools can integrate wan-v2.6-reference-to-video to offer users reference-based generation as a core feature. The model's support for multiple reference inputs and native audio synchronization makes it suitable for professional workflows requiring consistent output quality and minimal post-processing.

Things to Be Aware Of

Experimental multi-reference support works best with complementary inputs; mixing disparate styles may cause minor inconsistencies
Known quirks: Longer durations (15s) can occasionally show subtle motion drift in complex actions, per user tests
Performance considerations: Stable on standard hardware via optimized pipelines, but multi-shot increases compute needs
Resource requirements: Handles 1080p efficiently; users report quick generations for 5-10s clips
Consistency factors: Excels in one/two-person shots; praised for clone-level subject preservation
Positive user feedback themes: "Game-changing for lip-sync accuracy" and "Seamless multi-shot flow" from recent discussions
Common concerns: Rare audio desync in noisy references; mitigated by clean inputs

Limitations

Limited to 15-second videos, requiring stitching for longer content
Optimal for one/two-person scenes; complex crowds or rapid multi-subject interactions may lose some fidelity
Relies heavily on reference quality; low-res or blurry inputs degrade output consistency

Pricing

Pricing Type: Dynamic

1080p resolution: duration * $0.15 per second from output video

AI TRENDS

Related AI Models

You can seamlessly integrate advanced AI capabilities into your applications without the hassle of managing complex infrastructure.

Image to Video

Core avatar video generation endpoint for producing videos of humans, animals, cartoons, and stylized characters with solid quality and reliable performance.

Kling | Avatar | v2 | Standard

20 s

Image to Video

Wan 2.6 Image-to-Video Flash is a lightweight model that quickly transforms images into videos with smooth motion and consistent visuals.

Wan | v2.6 | Image to Video | Flash

150 s

Image to Video

Generates a video by smoothly animating the transition between a start frame and an end frame, guided by text-based style and scene instructions.

Kling | v3 | Pro | Image to Video

250 s

Image to Video

Transfers motion from a reference video to any character image, with Pro mode delivering higher-quality results for complex dance movements and expressive gestures.

Kling | v2.6 | Pro | Motion Control

850 s

Explore More