Wan | v2.6 | Reference to Video

each::sense is in private beta.
Eachlabs | AI Workflows for app builders

WAN V2.6

Wan 2.6 is a reference-to-video model that generates high-quality videos while preserving visual style, motion, and scene consistency from a reference input.

Avg Run Time: 320.000s

Model Slug: wan-v2-6-reference-to-video

Release Date: December 16, 2025

Playground

Input

Output

Example Result

Preview and download your result.

Unsupported conditions - pricing not available for this input format

API & SDK

Create a Prediction

Send a POST request to create a new prediction. This will return a prediction ID that you'll use to check the result. The request should include your model inputs and API key.

Get Prediction Result

Poll the prediction endpoint with the prediction ID until the result is ready. The API uses long-polling, so you'll need to repeatedly check until you receive a success status.

Readme

Table of Contents
Overview
Technical Specifications
Key Considerations
Tips & Tricks
Capabilities
What Can I Use It For?
Things to Be Aware Of
Limitations

Overview

Wan 2.6 is a state-of-the-art AI video generation model developed by Alibaba, released in December 2025, specializing in reference-to-video generation that transforms reference videos, images, text prompts, or audio into high-fidelity 1080p videos at 24fps up to 15 seconds long. The wan-v2.6-reference-to-video variant excels at preserving visual style, motion, character consistency, and scene elements from reference inputs while enabling new content creation through prompts, making it ideal for production-ready videos with native lip-sync and multi-shot storytelling. Key capabilities include video reference replication for cloning appearances and voices, cinematic multi-shot narratives with seamless transitions, and synchronized audio-visual output for dialogue, music, and effects without manual editing.

This model builds on previous versions like Wan 2.5 with upgrades such as support for multiple reference inputs, enhanced motion stability, stronger temporal coherence, improved lip-sync accuracy, and extended video duration for fuller narratives. Its multimodal architecture integrates text, image, video, and audio processing in a unified workflow, allowing precise control over aspects like aspect ratios (16: 9, 9:16, 1:1), shot types, and prompt-driven scene planning. What sets wan-v2.6-reference-to-video apart is its ability to generate complex, reference-guided videos in a single pass, streamlining workflows for creators by automating multi-shot expansion, motion transfer, and audio sync that typically require separate tools.

Technical Specifications

  • Architecture: Multimodal video generation model with reference-guided conditioning, ControlNet-like support, and IP-Adapter for stability
  • Parameters: Not publicly specified
  • Resolution: 1080p at 24fps, up to 15 seconds duration
  • Input/Output formats: Reference videos/images (single or multiple), text prompts, audio URLs; outputs MP4 videos with native audio
  • Performance metrics: Enhanced motion stability and temporal coherence over Wan 2.5; commercial-grade consistency and lip-sync accuracy

Key Considerations

  • Use high-quality reference videos (at least 5 seconds) for optimal character and motion replication to maintain consistency across shots
  • Best practices include starting with simple prompts for scene planning, then refining iteratively with specific shot types and negative prompts
  • Common pitfalls: Overloading with too many references can reduce stability; limit to one primary video and supplementary images
  • Quality vs speed trade-offs: Higher durations (up to 15s) increase generation time but enable fuller narratives; prioritize 1080p for production use
  • Prompt engineering tips: Describe desired actions, lighting, and camera movements explicitly (e.g., "dance battle with cinematic lighting, dynamic camera"); use prompt_extend for auto-expansion

Tips & Tricks

  • Optimal parameter settings: Set duration to 10-15s for multi-shot stories, enable audio=true for lip-sync, choose seed for reproducibility, and select aspect ratio matching output needs
  • Prompt structuring advice: Combine reference description with action prompt (e.g., "Use reference for character look: Dance battle between Character1 and Character2, cinematic lighting")
  • How to achieve specific results: For voice cloning, upload audiourl alongside reference video; for multi-person shots, specify interactions in prompt
  • Iterative refinement strategies: Generate initial output, analyze for inconsistencies, tweak prompt/negativeprompt, and regenerate with same seed
  • Advanced techniques: Feed multiple static images for style fusion in complex scenes; use shot_type parameter for transitions like close-up to wide

Capabilities

  • Generates high-fidelity 1080p, 24fps videos with fluid motion, sharp details, and film-style lighting from references
  • Precise lip-sync and native audio generation for voiceovers, music, and effects perfectly aligned frame-by-frame
  • Multi-shot storytelling with automatic scene planning, seamless transitions, and consistent characters across shots
  • Reference video replication for cloning subjects (people, animals, objects) including look, voice, and motion
  • Versatile modes: reference-to-video, image-to-video, text-to-video with multimodal integration for professional outputs
  • Strong temporal coherence and stability, especially in motion transfer and multi-reference guidance

What Can I Use It For?

  • Marketing agencies creating ads with cloned spokespersons and lip-synced pitches from reference clips
  • E-commerce product demos using reference videos for consistent branding and motion in promotional sequences
  • Filmmakers generating storyboards-turned-videos with multi-shot narratives and custom audio sync
  • Educators producing short explanatory videos with character-consistent animations from image references
  • Social media content like TikTok/Reels with dynamic dances or battles derived from user-submitted references
  • Corporate communications for internal training modules featuring narrated multi-scene stories

Things to Be Aware Of

  • Experimental multi-reference support works best with complementary inputs; mixing disparate styles may cause minor inconsistencies
  • Known quirks: Longer durations (15s) can occasionally show subtle motion drift in complex actions, per user tests
  • Performance considerations: Stable on standard hardware via optimized pipelines, but multi-shot increases compute needs
  • Resource requirements: Handles 1080p efficiently; users report quick generations for 5-10s clips
  • Consistency factors: Excels in one/two-person shots; praised for clone-level subject preservation
  • Positive user feedback themes: "Game-changing for lip-sync accuracy" and "Seamless multi-shot flow" from recent discussions
  • Common concerns: Rare audio desync in noisy references; mitigated by clean inputs

Limitations

  • Limited to 15-second videos, requiring stitching for longer content
  • Optimal for one/two-person scenes; complex crowds or rapid multi-subject interactions may lose some fidelity
  • Relies heavily on reference quality; low-res or blurry inputs degrade output consistency

Pricing

Pricing Type: Dynamic

1080p resolution: duration * $0.15 per second from output video