WAN-2.7
Wan 2.7 Text-to-Video generates high-quality videos from text prompts with optional audio synchronization, auto-generated background music, and intelligent prompt enhancement.
Avg Run Time: 200.000s
Model Slug: alibaba-wan-2-7-text-to-video
Release Date: April 3, 2026
Playground
Input
Output
Example Result
Preview and download your result.
API & SDK
Create a Prediction
Send a POST request to create a new prediction. This will return a prediction ID that you'll use to check the result. The request should include your model inputs and API key.
Get Prediction Result
Poll the prediction endpoint with the prediction ID until the result is ready. The API uses long-polling, so you'll need to repeatedly check until you receive a success status.
Readme
Overview
Alibaba | Wan 2.7 | Text to Video generates high-quality 1080p videos from text prompts, supporting durations up to 15 seconds with native audio synchronization and multi-reference capabilities. Developed by Alibaba as part of the Wan AI family, this model excels in text-to-video (T2V), image-to-video (I2V), and reference-to-video (R2V) workflows, distinguishing itself through support for up to 5 simultaneous references for complex multi-subject scenes and temporal feature transfer from source videos.
It addresses key challenges in video generation by enabling precise control over first and last frames, joint image-video-audio inputs for subject and voice cloning, and native 1080p output without upscaling artifacts. Ideal for creators needing professional-grade videos with consistent identity preservation and motion dynamics, Alibaba | Wan 2.7 | Text to Video powers efficient production on platforms like each::labs, streamlining workflows from concept to final clip.
Technical Specifications
- Resolution: Native 1080p across all generation and editing modes
- Max Duration: 2-15 seconds for T2V and I2V; 2-10 seconds for R2V
- Aspect Ratios: Flexible, including 16:9, 9:16, 1:1, 4:3, 3:4 (auto-matches input where applicable)
- Input Modalities: Text prompts, images (up to 5 references), videos, audio for synchronized control; supports real human inputs as first frames or references
- Output Formats: Video with native audio; 720p or 1080p options in editing modes
- Processing: Serverless deployment; T2V/I2V optimized for flexible duration control and multi-subject composition
- Architecture: Built on Wan family with temporal feature transfer for motion, camera, and effects preservation
These specs enable high-fidelity outputs suitable for professional use via the Alibaba | Wan 2.7 | Text to Video API.
Key Considerations
Before using Alibaba | Wan 2.7 | Text to Video, ensure prompts are detailed for optimal subject consistency, as multi-reference inputs (up to 5) demand clear descriptions to avoid blending issues. It shines in scenarios requiring precise motion transfer or voice synchronization, outperforming basic T2V models, but may require experimentation for complex physics simulations.
Access via each::labs provides serverless scaling without local setup, with cost-effective pricing around $1.60-$3.00 per million tokens. Best for short-form content like social media clips; for longer videos, chain generations. No open weights yet, so cloud API is primary—expect local deployment post-Q2 2026.
Tips & Tricks
For best results with Alibaba | Wan 2.7 | Text to Video, use structured prompts specifying subject actions, camera movement, and style: "A professional chef slicing vegetables in a modern kitchen, smooth panning shot from left to right, cinematic lighting, 1080p." Include references for identity lock—combine image for appearance and short video clip for motion.
Optimize parameters by setting first-frame images for I2V and enabling joint audio refs for voice cloning. For multi-subject scenes, number references in prompts like "Subject 1 from ref1 dances with Subject 2 from ref2." Test seeds for reproducibility and iterate with negative prompts to refine, e.g., "avoid blurry motion, distortion." Workflow: Generate keyframes via T2V, then extend with R2V for seamless sequences on each::labs.
Example: "Serene mountain landscape at sunset with flowing river, drone shot ascending, orchestral background music synced naturally."
Capabilities
- Text-to-video generation up to 15s at 1080p with native audio
- Image-to-video with first/last frame control and 3x3 grid-to-video for multi-scene inputs
- Reference-to-video supporting up to 5 simultaneous image/video/audio refs for multi-subject compositions
- Joint subject+voice control via mixed inputs, preserving identity and speech patterns
- Temporal feature transfer: Copies motion, camera work, and effects from source videos
- Instruction-based video editing: Swap elements, backgrounds, or styles via text descriptions
- Real human inputs as references or first frames for natural appearance and motion
- Flexible aspect ratios and duration control across T2V, I2V, R2V modes
What Can I Use It For?
Content Creators: Produce YouTube intros with multi-subject action; e.g., "Two dancers performing synchronized routine from ref images, dynamic camera zoom, upbeat music." Leverages 5-ref support for precise choreography.
Marketers: Generate product demos via I2V: "Smartphone rotating on reflective surface from product photo ref, smooth 360 spin, professional voiceover synced." Uses temporal transfer for realistic motion.
Filmmakers: Storyboard extensions with R2V: "Extend actor scene from video ref, add fantasy background swap per instructions, maintain lip-sync." Ideal for first/last frame control in pre-vis.
Designers: Social media reels: "Fashion model walking runway from image refs, 9:16 vertical, trendy music auto-generated." Excels in grid-to-video for batch concepts on each::labs.
Things to Be Aware Of
Alibaba | Wan 2.7 | Text to Video has a steeper learning curve for multi-ref setups—mismatched references can cause identity drift or inconsistent motion. Physics simulations trail advanced models, showing trails in fast-action scenes.
Common mistakes include vague prompts leading to generic outputs; always specify timing and style. Resource needs are low via API, but complex 5-ref jobs take longer. Edge cases like extreme deformations or rapid cuts may artifact; preview short tests first.
Limitations
Max 15s duration restricts long-form content; chain outputs for extensions. No 4K yet—capped at 1080p. Physics and complex interactions underperform, with occasional motion trails. Open weights pending, limiting local use. Not ideal for photorealistic humans without strong refs; text rendering in videos unconfirmed.
Pricing
Pricing Type: Dynamic
1080P pricing: $0.15/sec (default)
Current Pricing
Pricing Rules
| Condition | Pricing |
|---|---|
resolution matches "720P" | 720P pricing: $0.10/sec |
Rule 2(Active) | 1080P pricing: $0.15/sec (default) |