WAN-2.7

Wan 2.7 Image-to-Video generates high-quality videos from a single image with optional last-frame control, offering guided motion, audio synchronization, and intelligent prompt enhancement.

Avg Run Time: 200.000s

Model Slug: alibaba-wan-2-7-image-to-video

Release Date: April 3, 2026

Playground

Input

Prompt

First Frame

Enter a URL or choose a file from your computer.

Invalid URL.

(Max 50MB)

Last Frame

Enter a URL or choose a file from your computer.

Invalid URL.

(Max 50MB)

First Clip

Enter a URL or choose a file from your computer.

Click to upload or drag and drop

(Max 50MB)

Driving Audio

Enter a URL or choose a file from your computer.

Click to upload or drag and drop

(Max 50MB)

Negative Prompt

Resolution

Duration

Prompt Extend

Seed

Output

Example Result

Preview and download your result.

1080P pricing: $0.15/sec (default)

API & SDK

Snippets reference the EACHLABS_API_KEY environment variable. Copy your real API key from /api-keys and set it locally before running.

Create a Prediction

Send a POST request to create a new prediction. This will return a prediction ID that you'll use to check the result. The request should include your model inputs and API key.

Get Prediction Result

Poll the prediction endpoint with the prediction ID until the result is ready. The API uses long-polling, so you'll need to repeatedly check until you receive a success status.

Readme

Table of Contents

Overview

Technical Specifications

Key Considerations

Tips & Tricks

Capabilities

What Can I Use It For?

Things to Be Aware Of

Limitations

Overview

Alibaba | Wan 2.7 | Image to Video transforms a single static image into dynamic, high-quality video clips, solving the challenge of adding realistic motion and audio to visuals without complex editing tools. Developed by Alibaba Tongyi Lab as part of the advanced Wan 2.7 family, this model stands out with its support for guided motion control, including first/last frame options, and embedded audio synchronization for cinematic results. Users provide an input image and optional prompts to generate videos up to 15-30 seconds at 1080p or higher resolutions, making it ideal for creators seeking professional-grade output directly from each::labs (eachlabs.ai). The Alibaba | Wan 2.7 | Image to Video API enables seamless integration for developers building multimodal applications.

Technical Specifications

Resolution Support: Native 1080p HD, up to 4K cinematic fidelity in advanced modes (e.g., 2048×2048 or 4096×4096 for related image tasks).
Max Duration: 15-30 seconds per generation, extending beyond previous Wan models' 5-10 second limits.
Aspect Ratios: Flexible, including custom dimensions like 1920×1080; supports widescreen cinematic formats.
Input Formats: Single image input with text prompt (up to 5,000 characters); optional multi-reference images (up to 9) for control.
Output Formats: Video with native audio; MP4 or similar standard video files.
Processing Time: Efficient rendering via Diffusion Transformer architecture with T5 encoder and MoE routing; near-instant scaling on cloud infrastructure.
Architecture: Video diffusion model with synchronous audio-visual Flow Matching for enhanced speed and quality.

Key Considerations

Before using Alibaba | Wan 2.7 | Image to Video on each::labs (eachlabs.ai), ensure your input image is high-resolution for optimal motion transfer. This model excels in scenarios requiring precise frame control, like extending static shots into narrated scenes, over basic text-to-video alternatives. Processing favors cloud deployment due to high compute needs, balancing cost with output quality—expect credits-based pricing starting around $10 for substantial usage. Developers integrating the Alibaba | Wan 2.7 | Image to Video API should account for prompt length limits and enable thinking mode for complex edits. Best for professional workflows where audio sync and duration matter more than ultra-short clips.

Tips & Tricks

Optimize prompts for Alibaba | Wan 2.7 | Image to Video by specifying motion direction, speed, and audio cues explicitly, leveraging its contextual command processing. Use "first frame: [describe input image], last frame: [target pose], smooth camera pan right with ambient forest sounds" to guide transitions precisely. Enable thinking mode for better reasoning on intricate scenes, and experiment with multi-image references (up to 9) for style-consistent animations. Set seed values for reproducible results during iteration. For longer videos, break prompts into sequential generations with endpoint anchors.

Example prompts:

"Animate this portrait with gentle head turn left, smiling expression, soft orchestral background music rising to crescendo."
"Convert landscape photo to flying drone shot over mountains at sunset, wind sounds and eagle calls synchronized."
"Image to video: character walks forward from static pose, rain falling, thunder audio effects building tension."

Combine with each::labs (eachlabs.ai) workflows for rapid prototyping.

Capabilities

Generates high-quality videos from a single input image with realistic motion dynamics up to 15-30 seconds.
Supports first/last frame control for precise guided motion and endpoint anchoring.
Includes native audio synchronization with embedded scene acoustics like ambient sounds or music.
Handles multi-reference inputs (up to 9 images) for 9-grid multi-scene video composition.
Offers instruction-based editing via Diffusion Transformer for text-driven adjustments.
Delivers 1080p to 4K resolutions with flexible aspect ratios and custom dimensions.
Features subject and voice cloning integration for consistent character animation.
Supports contextual prompt enhancement with T5 encoder for complex commands.

What Can I Use It For?

Content Creators: Filmmakers can animate storyboards by inputting a keyframe image with prompts like "first frame: hero stands ready, last frame: draws sword dramatically, epic orchestral score swells," producing 15-second clips with synced audio for quick edits.

Marketers: Agencies generate product demo videos from a static photo, using multi-reference grids: "Pan around smartphone from top view, highlight features with voiceover narration," ideal for social media ads with native sound.

Developers: Build interactive apps via the Alibaba | Wan 2.7 | Image to Video API on each::labs (eachlabs.ai), feeding user-uploaded images into "animate avatar with custom gesture sequence and speech audio," for personalized virtual assistants.

Designers: Animate UI mockups with "transition static wireframe to interactive prototype, subtle click sounds and hover effects," leveraging instruction editing for precise motion control in presentation reels.

Things to Be Aware Of

Alibaba | Wan 2.7 | Image to Video performs best with clear, high-contrast input images; blurry sources lead to motion artifacts. Complex physics simulations, like rapid object interactions, may show trails compared to specialized models. Users often overlook prompt specificity—vague descriptions yield generic motion. Resource-intensive for local runs; rely on cloud via each::labs (eachlabs.ai) to avoid GPU overload. Steeper learning curve for multi-frame control, so test short clips first. Audio sync shines in ambient scenes but requires descriptive cues for dialogue-heavy outputs.

Limitations

Alibaba | Wan 2.7 | Image to Video caps at 15-30 seconds, unsuitable for full-length productions. Resolution tops at 1080p standard, with 4K limited to pro modes or image tasks. Struggles with hyper-realistic physics in fast-action scenes, producing occasional artifacts. No open weights yet—cloud-only access via APIs like on each::labs (eachlabs.ai). Input restricted to up to 9 reference images; longer texts beyond 5,000 characters unsupported.

Pricing

Pricing Type: Dynamic

1080P pricing: $0.15/sec (default)

Current Pricing

1080P pricing: $0.15/sec (default)

Estimated cost: $0.7500

Pricing Rules

Condition	Pricing
`resolution matches "720P"`	720P pricing: $0.10/sec
`Rule 2`(Active)	1080P pricing: $0.15/sec (default)

AI TRENDS

Related AI Models

You can seamlessly integrate advanced AI capabilities into your applications without the hassle of managing complex infrastructure.

Image to Video

Kling Native 4K delivers professional-grade 4K video directly in a single pass, removing the need for post-production upscaling.

Kling | v3 | 4K | Image to Video

200 s

Image to Video

PixVerse C1 cinematic First-Last Frame transition. Provide a first and last frame plus a prompt; C1 generates a smooth cinematic transition between them. Supports 1-15s duration and up to 1080p quality. Optional generated audio.

PixVerse C1 Transition

110 s

Image to Video

An advanced video model delivering cinematic visuals with native audio, realistic physics, and precise camera control, supporting text, image, audio, and video inputs.

Bytedance | Seedance 2.0 | Image to Video | Fast

150 s

Image to Video

P-Video Avatar generates talking avatars from a single image with reliable lip sync. A fast model on each::labs for short-form ads and branded video.

P Video Avatar

20 s

Explore More