Eachlabs | AI Workflows for app builders

WAN-2.7

Wan 2.7 Text-to-Video generates high-quality videos from text prompts with optional audio synchronization, auto-generated background music, and intelligent prompt enhancement.

Avg Run Time: 200.000s

Model Slug: alibaba-wan-2-7-text-to-video

Release Date: April 3, 2026

Playground

Input

Enter a URL or choose a file from your computer.

Output

Example Result

Preview and download your result.

1080P pricing: $0.15/sec (default)

API & SDK

Create a Prediction

Send a POST request to create a new prediction. This will return a prediction ID that you'll use to check the result. The request should include your model inputs and API key.

Get Prediction Result

Poll the prediction endpoint with the prediction ID until the result is ready. The API uses long-polling, so you'll need to repeatedly check until you receive a success status.

Readme

Table of Contents
Overview
Technical Specifications
Key Considerations
Tips & Tricks
Capabilities
What Can I Use It For?
Things to Be Aware Of
Limitations

Overview

Alibaba | Wan 2.7 | Text to Video generates high-quality 1080p videos from text prompts, supporting durations up to 15 seconds with native audio synchronization and multi-reference capabilities. Developed by Alibaba as part of the Wan AI family, this model excels in text-to-video (T2V), image-to-video (I2V), and reference-to-video (R2V) workflows, distinguishing itself through support for up to 5 simultaneous references for complex multi-subject scenes and temporal feature transfer from source videos.

It addresses key challenges in video generation by enabling precise control over first and last frames, joint image-video-audio inputs for subject and voice cloning, and native 1080p output without upscaling artifacts. Ideal for creators needing professional-grade videos with consistent identity preservation and motion dynamics, Alibaba | Wan 2.7 | Text to Video powers efficient production on platforms like each::labs, streamlining workflows from concept to final clip.

Technical Specifications

  • Resolution: Native 1080p across all generation and editing modes
  • Max Duration: 2-15 seconds for T2V and I2V; 2-10 seconds for R2V
  • Aspect Ratios: Flexible, including 16:9, 9:16, 1:1, 4:3, 3:4 (auto-matches input where applicable)
  • Input Modalities: Text prompts, images (up to 5 references), videos, audio for synchronized control; supports real human inputs as first frames or references
  • Output Formats: Video with native audio; 720p or 1080p options in editing modes
  • Processing: Serverless deployment; T2V/I2V optimized for flexible duration control and multi-subject composition
  • Architecture: Built on Wan family with temporal feature transfer for motion, camera, and effects preservation

These specs enable high-fidelity outputs suitable for professional use via the Alibaba | Wan 2.7 | Text to Video API.

Key Considerations

Before using Alibaba | Wan 2.7 | Text to Video, ensure prompts are detailed for optimal subject consistency, as multi-reference inputs (up to 5) demand clear descriptions to avoid blending issues. It shines in scenarios requiring precise motion transfer or voice synchronization, outperforming basic T2V models, but may require experimentation for complex physics simulations.

Access via each::labs provides serverless scaling without local setup, with cost-effective pricing around $1.60-$3.00 per million tokens. Best for short-form content like social media clips; for longer videos, chain generations. No open weights yet, so cloud API is primary—expect local deployment post-Q2 2026.

Tips & Tricks

For best results with Alibaba | Wan 2.7 | Text to Video, use structured prompts specifying subject actions, camera movement, and style: "A professional chef slicing vegetables in a modern kitchen, smooth panning shot from left to right, cinematic lighting, 1080p." Include references for identity lock—combine image for appearance and short video clip for motion.

Optimize parameters by setting first-frame images for I2V and enabling joint audio refs for voice cloning. For multi-subject scenes, number references in prompts like "Subject 1 from ref1 dances with Subject 2 from ref2." Test seeds for reproducibility and iterate with negative prompts to refine, e.g., "avoid blurry motion, distortion." Workflow: Generate keyframes via T2V, then extend with R2V for seamless sequences on each::labs.

Example: "Serene mountain landscape at sunset with flowing river, drone shot ascending, orchestral background music synced naturally."

Capabilities

  • Text-to-video generation up to 15s at 1080p with native audio
  • Image-to-video with first/last frame control and 3x3 grid-to-video for multi-scene inputs
  • Reference-to-video supporting up to 5 simultaneous image/video/audio refs for multi-subject compositions
  • Joint subject+voice control via mixed inputs, preserving identity and speech patterns
  • Temporal feature transfer: Copies motion, camera work, and effects from source videos
  • Instruction-based video editing: Swap elements, backgrounds, or styles via text descriptions
  • Real human inputs as references or first frames for natural appearance and motion
  • Flexible aspect ratios and duration control across T2V, I2V, R2V modes

What Can I Use It For?

Content Creators: Produce YouTube intros with multi-subject action; e.g., "Two dancers performing synchronized routine from ref images, dynamic camera zoom, upbeat music." Leverages 5-ref support for precise choreography.

Marketers: Generate product demos via I2V: "Smartphone rotating on reflective surface from product photo ref, smooth 360 spin, professional voiceover synced." Uses temporal transfer for realistic motion.

Filmmakers: Storyboard extensions with R2V: "Extend actor scene from video ref, add fantasy background swap per instructions, maintain lip-sync." Ideal for first/last frame control in pre-vis.

Designers: Social media reels: "Fashion model walking runway from image refs, 9:16 vertical, trendy music auto-generated." Excels in grid-to-video for batch concepts on each::labs.

Things to Be Aware Of

Alibaba | Wan 2.7 | Text to Video has a steeper learning curve for multi-ref setups—mismatched references can cause identity drift or inconsistent motion. Physics simulations trail advanced models, showing trails in fast-action scenes.

Common mistakes include vague prompts leading to generic outputs; always specify timing and style. Resource needs are low via API, but complex 5-ref jobs take longer. Edge cases like extreme deformations or rapid cuts may artifact; preview short tests first.

Limitations

Max 15s duration restricts long-form content; chain outputs for extensions. No 4K yet—capped at 1080p. Physics and complex interactions underperform, with occasional motion trails. Open weights pending, limiting local use. Not ideal for photorealistic humans without strong refs; text rendering in videos unconfirmed.

Pricing

Pricing Type: Dynamic

1080P pricing: $0.15/sec (default)

Current Pricing

1080P pricing: $0.15/sec (default)
Estimated cost: $1.05

Pricing Rules

ConditionPricing
resolution matches "720P"720P pricing: $0.10/sec
Rule 2(Active)1080P pricing: $0.15/sec (default)