Z-IMAGE
Generates images from text and reference images using Tongyi-MAI’s ultra-fast 6B Z-Image Turbo model for fast, high-quality visual results.
Avg Run Time: 10.000s
Model Slug: z-image-turbo-image-to-image
Release Date: December 8, 2025
Playground
Input
Enter a URL or choose a file from your computer.
Click to upload or drag and drop
(Max 50MB)
Output
Example Result
Preview and download your result.

API & SDK
Create a Prediction
Send a POST request to create a new prediction. This will return a prediction ID that you'll use to check the result. The request should include your model inputs and API key.
Get Prediction Result
Poll the prediction endpoint with the prediction ID until the result is ready. The API uses long-polling, so you'll need to repeatedly check until you receive a success status.
Readme
Overview
Z-Image-Turbo is a distilled version of the original Z-Image model developed by Alibaba's Tongyi-MAI lab. It is a lightweight 6B parameter image generation model designed for ultra-fast text-to-image synthesis, achieving high-quality photorealistic results in as few as 9 sampling steps. The model excels in speed and efficiency, making it suitable for real-time workflows and local deployment on consumer hardware.
Key features include photorealistic image generation with refined lighting, clean textures, and strong composition, alongside accurate bilingual text rendering in English and Chinese. It incorporates advanced world knowledge and semantic reasoning for handling complex prompts and culturally grounded concepts. What sets it apart is the Scalable Single-Stream Multi-Modal Diffusion Transformer (S3-DiT) architecture, which processes text, embeddings, and noisy latents in a unified sequence, enabling dense cross-modal interactions and superior performance at a compact scale compared to larger models.
This architecture, combined with a unique training strategy leveraging real-world data streams, allows Z-Image-Turbo to outperform previous state-of-the-art open-source models in speed and cost-effectiveness while maintaining competitive quality, as validated in benchmarks like Alibaba AI Arena.
Technical Specifications
- Architecture: Scalable Single-Stream Multi-Modal Diffusion Transformer (S3-DiT)
- Parameters: 6B
- Resolution: Not explicitly specified; supports high-fidelity photorealistic outputs
- Input/Output formats: Text prompts to images; supports bilingual text rendering; compatible with FP8, AIO, GGUF, BF16 quantized variants
- Performance metrics: Generates 100 images in 279 seconds (4:39 min); 9-step inference at ~9 seconds per image on 24GB GPU; subsecond latency on high-end GPUs; outperforms competitors like Flux.2 Dev (19:12 min for 100 images) and Ovis-Image (8:28 min) in speed
Key Considerations
- Use minimal sampling steps (e.g., 9) for maximum speed, but increase to 20+ for higher detail in complex scenes
- Optimize VRAM usage with quantized versions like FP8 or GGUF to fit on 16-24GB consumer GPUs
- Balance quality and speed: lower steps prioritize rapidity but may reduce fine details compared to larger models
- Prompt with clear, descriptive language emphasizing style, lighting, and composition for best photorealism
- Avoid overly abstract or highly intricate prompts initially, as the model's distillation favors straightforward semantic understanding
- Test on local hardware to account for variability in inference time based on GPU and optimizations
Tips & Tricks
- Optimal parameter settings: 9-16 sampling steps, CFG scale 3-7, use BF16 or FP8 for speed on consumer GPUs
- Prompt structuring advice: Start with subject description, add style qualifiers (e.g., "photorealistic, sharp lighting"), specify bilingual text needs explicitly
- Achieve photorealism: Include terms like "clean textures, balanced composition, refined lighting" in prompts
- Iterative refinement: Generate initial low-step outputs, then upscale or refine with higher steps using the same seed
- Advanced techniques: Leverage S3-DiT for complex scenes by chaining prompts with semantic details (e.g., "culturally accurate Chinese festival scene with English signage"); experiment with GGUF workflows for MacOS compatibility, resolving float8 conversion errors via custom nodes
Capabilities
- Generates high-quality photorealistic images with excellent detail preservation at ultra-low latency
- Accurate bilingual text rendering in posters, graphics, and small fonts with proper alignment and typography
- Strong semantic reasoning and world knowledge for logical, culturally grounded outputs
- Versatile across styles, from realistic scenes to creative compositions, matching larger models in fidelity
- Efficient local inference on 16-24GB GPUs, enabling real-time generation
- Superior speed in benchmarks, nearly twice as fast as next competitors for batch processing
What Can I Use It For?
- Rapid prototyping of visual concepts in creative workflows, as noted in reviews for quick iterations
- Generating photorealistic product visuals or marketing graphics with bilingual text support
- Local offline image creation for personal projects, highlighted in user tests on consumer GPUs
- Real-time applications like dynamic content generation, praised for subsecond latency potential
- High-volume batch processing, demonstrated in benchmarks producing 100 images in under 5 minutes
Things to Be Aware Of
- Runs efficiently on 24GB GPUs like mobile 5090, with usage closer to 24GB unoptimized; quantized versions reduce to 16GB
- Outputs closely resemble leading models like Flux.2 Dev in quality but with extreme speed trade-off
- Common MacOS issues include KSampler float8 conversion errors, resolvable with GGUF custom nodes
- Consistent high aesthetic quality in benchmarks, especially photorealism, but may lack ultra-fine details of massive models
- Positive feedback on speed and local runnability: "one of the fastest offline models I've seen" and "fantastic overall"
- Variability in generation time (e.g., 9 seconds to slightly longer for complex prompts) based on hardware and optimizations
Limitations
- Distilled design prioritizes speed over maximum detail, potentially underperforming larger models in hyper-intricate or artistic nano-level quality
- Higher VRAM usage than expected without quantization (up to 24GB); may require optimizations for lower-end hardware
- Experimental quantized variants (FP8, GGUF) can encounter platform-specific errors like float8 issues on MacOS
Related AI Models
You can seamlessly integrate advanced AI capabilities into your applications without the hassle of managing complex infrastructure.
