WAN-2.7

Wan 2.7 Text-to-Video generates high-quality videos from text prompts with optional audio synchronization, auto-generated background music, and intelligent prompt enhancement.

Avg Run Time: 200.000s

Model Slug: alibaba-wan-2-7-text-to-video

Release Date: April 3, 2026

Playground

Input

Prompt*

Audio URL

Enter a URL or choose a file from your computer.

Click to upload or drag and drop

(Max 50MB)

Negative Prompt

Resolution

Aspect Ratio

Duration

Prompt Extend

Seed

Output

Example Result

Preview and download your result.

1080P pricing: $0.15/sec (default)

API & SDK

Snippets reference the EACHLABS_API_KEY environment variable. Copy your real API key from /api-keys and set it locally before running.

Create a Prediction

Send a POST request to create a new prediction. This will return a prediction ID that you'll use to check the result. The request should include your model inputs and API key.

Get Prediction Result

Poll the prediction endpoint with the prediction ID until the result is ready. The API uses long-polling, so you'll need to repeatedly check until you receive a success status.

Readme

Table of Contents

Overview

Technical Specifications

Key Considerations

Tips & Tricks

Capabilities

What Can I Use It For?

Things to Be Aware Of

Limitations

Overview

Alibaba | Wan 2.7 | Text to Video generates high-quality 1080p videos from text prompts, supporting durations up to 15 seconds with native audio synchronization and multi-reference capabilities. Developed by Alibaba as part of the Wan AI family, this model excels in text-to-video (T2V), image-to-video (I2V), and reference-to-video (R2V) workflows, distinguishing itself through support for up to 5 simultaneous references for complex multi-subject scenes and temporal feature transfer from source videos.

It addresses key challenges in video generation by enabling precise control over first and last frames, joint image-video-audio inputs for subject and voice cloning, and native 1080p output without upscaling artifacts. Ideal for creators needing professional-grade videos with consistent identity preservation and motion dynamics, Alibaba | Wan 2.7 | Text to Video powers efficient production on platforms like each::labs, streamlining workflows from concept to final clip.

Technical Specifications

Resolution: Native 1080p across all generation and editing modes
Max Duration: 2-15 seconds for T2V and I2V; 2-10 seconds for R2V
Aspect Ratios: Flexible, including 16:9, 9:16, 1:1, 4:3, 3:4 (auto-matches input where applicable)
Input Modalities: Text prompts, images (up to 5 references), videos, audio for synchronized control; supports real human inputs as first frames or references
Output Formats: Video with native audio; 720p or 1080p options in editing modes
Processing: Serverless deployment; T2V/I2V optimized for flexible duration control and multi-subject composition
Architecture: Built on Wan family with temporal feature transfer for motion, camera, and effects preservation

These specs enable high-fidelity outputs suitable for professional use via the Alibaba | Wan 2.7 | Text to Video API.

Key Considerations

Before using Alibaba | Wan 2.7 | Text to Video, ensure prompts are detailed for optimal subject consistency, as multi-reference inputs (up to 5) demand clear descriptions to avoid blending issues. It shines in scenarios requiring precise motion transfer or voice synchronization, outperforming basic T2V models, but may require experimentation for complex physics simulations.

Access via each::labs provides serverless scaling without local setup, with cost-effective pricing around $1.60-$3.00 per million tokens. Best for short-form content like social media clips; for longer videos, chain generations. No open weights yet, so cloud API is primary—expect local deployment post-Q2 2026.

Tips & Tricks

For best results with Alibaba | Wan 2.7 | Text to Video, use structured prompts specifying subject actions, camera movement, and style: "A professional chef slicing vegetables in a modern kitchen, smooth panning shot from left to right, cinematic lighting, 1080p." Include references for identity lock—combine image for appearance and short video clip for motion.

Optimize parameters by setting first-frame images for I2V and enabling joint audio refs for voice cloning. For multi-subject scenes, number references in prompts like "Subject 1 from ref1 dances with Subject 2 from ref2." Test seeds for reproducibility and iterate with negative prompts to refine, e.g., "avoid blurry motion, distortion." Workflow: Generate keyframes via T2V, then extend with R2V for seamless sequences on each::labs.

Example: "Serene mountain landscape at sunset with flowing river, drone shot ascending, orchestral background music synced naturally."

Capabilities

Text-to-video generation up to 15s at 1080p with native audio
Image-to-video with first/last frame control and 3x3 grid-to-video for multi-scene inputs
Reference-to-video supporting up to 5 simultaneous image/video/audio refs for multi-subject compositions
Joint subject+voice control via mixed inputs, preserving identity and speech patterns
Temporal feature transfer: Copies motion, camera work, and effects from source videos
Instruction-based video editing: Swap elements, backgrounds, or styles via text descriptions
Real human inputs as references or first frames for natural appearance and motion
Flexible aspect ratios and duration control across T2V, I2V, R2V modes

What Can I Use It For?

Content Creators: Produce YouTube intros with multi-subject action; e.g., "Two dancers performing synchronized routine from ref images, dynamic camera zoom, upbeat music." Leverages 5-ref support for precise choreography.

Marketers: Generate product demos via I2V: "Smartphone rotating on reflective surface from product photo ref, smooth 360 spin, professional voiceover synced." Uses temporal transfer for realistic motion.

Filmmakers: Storyboard extensions with R2V: "Extend actor scene from video ref, add fantasy background swap per instructions, maintain lip-sync." Ideal for first/last frame control in pre-vis.

Designers: Social media reels: "Fashion model walking runway from image refs, 9:16 vertical, trendy music auto-generated." Excels in grid-to-video for batch concepts on each::labs.

Things to Be Aware Of

Alibaba | Wan 2.7 | Text to Video has a steeper learning curve for multi-ref setups—mismatched references can cause identity drift or inconsistent motion. Physics simulations trail advanced models, showing trails in fast-action scenes.

Common mistakes include vague prompts leading to generic outputs; always specify timing and style. Resource needs are low via API, but complex 5-ref jobs take longer. Edge cases like extreme deformations or rapid cuts may artifact; preview short tests first.

Limitations

Max 15s duration restricts long-form content; chain outputs for extensions. No 4K yet—capped at 1080p. Physics and complex interactions underperform, with occasional motion trails. Open weights pending, limiting local use. Not ideal for photorealistic humans without strong refs; text rendering in videos unconfirmed.

Pricing

Pricing Type: Dynamic

1080P pricing: $0.15/sec (default)

Current Pricing

1080P pricing: $0.15/sec (default)

Estimated cost: $1.05

Pricing Rules

Condition	Pricing
`resolution matches "720P"`	720P pricing: $0.10/sec
`Rule 2`(Active)	1080P pricing: $0.15/sec (default)

AI TRENDS

Related AI Models

You can seamlessly integrate advanced AI capabilities into your applications without the hassle of managing complex infrastructure.

Text to Video

Creates video sequences from text descriptions with smooth motion and cinematic control, offering precise frame-level artistic direction.

Alibaba | HappyHorse 1.0 | Text to Video

200 s

Explore More