Wan v2.6 · Text to Video

Video·wan-v2.6·by Alibaba

Wan 2.6 is a text-to-video model that generates high-quality videos with smooth motion and cinematic detail.

Runtime (p50)
5m
Estimated price
From $0.1
Call the API
prediction.sh
sh
curl -X POST \
  -H "X-API-Key: $EACHLABS_API_KEY" \
  -H "Content-Type: application/json" \
  --data '{
    "model": "wan-v2-6-text-to-video",
    "version": "0.0.1",
    "input": {
        "prompt": "Humorous but Premium Mini-Trailer\n\nConcept: A tiny porcelain robot curator controls reality like a film set.\n\nVisual style (global):\nExtreme photoreal 4K, cinematic lighting, shallow depth of field, subtle film grain, smooth stabilized camera, premium VFX realism.\n\nShot 1 — [0–3s]\n\nScene:\nMacro close-up of a small porcelain robot curator.\nIt gently taps a miniature tuning fork engraved with “eachlabs”.\nSoft reverb in the air.\n\nDialogue (natural, calm):\n\n“Begin.”\n\nShot 2 — [3–6s]\n\nHard cut:\nA vast Arctic ice plain under pale blue sky. Wind blows snow across the ground.\nWide cinematic shot.\nThe robot stands in the foreground, tiny against the scale.\n\nIt slightly turns its head.\n\nDialogue:\n\n“More space.”\n\n(The horizon stretches wider, camera pulls back.)\n\nShot 3 — [6–10s]\n\nHard cut:\nA glowing underwater coral canyon. Sun rays penetrate the water, particles floating.\nThe robot calmly walks along the seabed, unaffected by water.\nCamera slowly tracks forward between corals and fish.\n\nDialogue (soft, curious):\n\n“Let it breathe.”\n\nShot 4 — [10–15s]\n\nHard cut:\nA silent lunar surface at dawn. Earth rising in the background.\nSlow orbital camera move around the robot as it looks at the horizon.\n\nIt nods once.\n\nDialogue:\n\n“Ready for the next reality.”",
        "aspect_ratio": "16:9",
        "resolution": "1080p",
        "duration": "15",
        "negative_prompt": "low resolution, error, worst quality, low quality, defects",
        "enable_prompt_expansion": true,
        "multi_shots": true,
        "enable_safety_checker": true
    },
    "webhook_url": ""
}' \
  https://api.eachlabs.ai/v1/prediction/
Documentation8 sections
  • Overview

    wan-v2.6-text-to-video — Text to Video AI Model

    Developed by Alibaba as part of the wan-v2.6 family, wan-v2.6-text-to-video is a cutting-edge text-to-video AI model that transforms text prompts into cinematic multi-shot videos up to 15 seconds long with synchronized audio. This Alibaba text-to-video solution excels in generating coherent narratives with smooth transitions, character stability, and professional camera control, solving the challenge of creating high-quality short-form video content without extensive editing. Ideal for developers seeking a text-to-video AI model with multi-shot capabilities, it supports 720p and 1080p resolutions at 30 fps in MP4 format, delivering polished outputs for commercial use.

  • Capabilities
    • Generates smooth, high-quality 1080p videos with cinematic detail, reduced jitter, and graceful depth/perspective transitions
    • Native audio integration with phoneme-level lip-sync, including emotional micro-gestures for realistic talking animations
    • Strong prompt adherence for complex instructions, multi-character scenes, and action sequences
    • Video-to-video motion transfer for stable character consistency and multi-shot storytelling
    • Multilingual support for text prompts and audio generation, enabling localized content
    • Efficient rendering for batch production of short-form videos like social media or educational clips
    • Versatile inputs: text, images, reference videos; aspect ratios for various formats
  • Use cases

    Use Cases for wan-v2.6-text-to-video

    Content creators producing social media reels can input a prompt like "A bustling city street at dusk transitioning to a cozy cafe interior with soft jazz audio syncing to barista movements" to generate a 10-second multi-shot video with seamless camera pans and ambient sound, ready for platforms like TikTok or Instagram.

    Marketers crafting product demos use wan-v2.6-text-to-video for text-to-video AI generation of explainers, such as turning "Slow-motion reveal of a smartphone on a rotating pedestal with sparkling reflections and upbeat music sync" into a 1080p clip that highlights features with realistic physics and lighting, bypassing costly shoots.

    Developers building apps with Alibaba text-to-video integration leverage its API for automated video assets, feeding prompts with optional audio to create personalized user content like "Avatar character walking through a futuristic city, narrating in a calm voice with matching lip sync," ensuring high consistency for interactive experiences.

    Filmmakers prototyping scenes input detailed storyboards to produce 15-second test footage with professional rhythm and transitions, accelerating pre-production for narrative shorts or ads.

  • Tips & tricks

    How to Use wan-v2.6-text-to-video on Eachlabs

    Access wan-v2.6-text-to-video seamlessly on Eachlabs via the Playground for instant testing, API for production-scale wan-v2.6-text-to-video API calls, or SDK for custom apps. Provide a text prompt, optional audio file, duration (2-15s), and resolution (720p/1080p); it outputs MP4 videos at 30 fps with multi-shot narratives and sync. Eachlabs delivers fast, high-fidelity results optimized for your workflows.

    ---
  • Technical spec

    What Sets wan-v2.6-text-to-video Apart

    wan-v2.6-text-to-video stands out in the text-to-video landscape through its rebuilt narrative engine, enabling precise interpretation of storyboard-style prompts for multi-shot sequences with natural camera movements and rhythm control—unlike single-clip generators. This allows users to produce full cinematic stories from a single text description, streamlining workflows for promotional clips and explainers.

    It supports integer durations from 2 to 15 seconds in 720p or 1080p at 30 fps, with optional audio input for lip-sync and ambient sound synchronization, maintaining temporal stability over extended lengths. Developers integrating the wan-v2.6-text-to-video API benefit from fast inference and high subject fidelity, reducing post-production needs.

    • Multi-shot narrative engine: Handles complex scene sequences and transitions for professional-grade storytelling.
    • Audio-video sync: Generates or syncs audio to match lip movements and scene context, perfect for talking-head or dynamic videos.
    • Extended 15s HD support: Delivers 1080p videos with consistent lighting, motion, and character identity across shots.
  • Things to be aware of
    • Users report dramatic improvements in audio sync and motion smoothness over Wan 2.5, with fewer artifacts and more human-like gestures
    • Early adopters highlight faster processing and accessibility, ideal for iterative workflows
    • Benchmarks show efficiency gains with sparse attention, reducing generation time significantly
    • Resource needs scale with model size; cloud-optimized but larger 14B variant demands more for fidelity
    • Community notes strong character consistency across shots and stable video-to-video pipelines
    • Positive feedback on prompt accuracy for precise executions, rivaling higher-end models in specific categories
    • Some discussions mention optimization for 5-15s clips, with chaining for longer content
  • Key considerations
    • Use detailed, procedural prompts for best literal accuracy in multi-character scenes or complex actions to leverage the model's strength in precise execution
    • Optimal for short clips (5-15s); chain multiple generations for longer narratives to maintain consistency
    • Balance model size: 5B for speed, 14B for higher fidelity in demanding scenes
    • Prioritize reference videos or images for video-to-video mode to enhance motion transfer and character stability
    • Avoid overly abstract or highly interpretive prompts, as the model favors cinematic clarity over loose creativity
    • Test lip-sync with clear audio inputs for natural emotional cues like gestures and expressions
  • Limitations
    • Limited to short durations (5-15s per generation), requiring chaining for extended videos which may introduce minor inconsistencies
    • Best for structured prompts; struggles with highly abstract or overly interpretive cinematic styles compared to specialized models
    • Higher resolutions and longer clips increase render times, though mitigated by optimizations like sparse attention

Related models

4 models
* FAQ

About Wan v2.6 · Text to Video

01 / 03

What is Wan v2.6 text-to-video and how does it generate video from text?

Wan v2.6 text-to-video is Alibaba's latest generation text-to-video model that generates high-quality video clips directly from natural language descriptions. It uses an advanced diffusion-based architecture with improved motion modeling to produce temporally coherent, visually detailed videos across diverse scenes, styles, and subject matters.