Z Image | Turbo | Image to Image

each::sense is in private beta.
Eachlabs | AI Workflows for app builders
z-image-turbo-image-to-image

Z-IMAGE

Generates images from text and reference images using Tongyi-MAI’s ultra-fast 6B Z-Image Turbo model for fast, high-quality visual results.

Avg Run Time: 10.000s

Model Slug: z-image-turbo-image-to-image

Release Date: December 8, 2025

Playground

Input

Enter a URL or choose a file from your computer.

Output

Example Result

Preview and download your result.

z-image-turbo-image-to-image
Your request will cost $0.005 per megapixel for output.

API & SDK

Create a Prediction

Send a POST request to create a new prediction. This will return a prediction ID that you'll use to check the result. The request should include your model inputs and API key.

Get Prediction Result

Poll the prediction endpoint with the prediction ID until the result is ready. The API uses long-polling, so you'll need to repeatedly check until you receive a success status.

Readme

Table of Contents
Overview
Technical Specifications
Key Considerations
Tips & Tricks
Capabilities
What Can I Use It For?
Things to Be Aware Of
Limitations

Overview

Z-Image-Turbo is a distilled version of the original Z-Image model developed by Alibaba's Tongyi-MAI lab. It is a lightweight 6B parameter image generation model designed for ultra-fast text-to-image synthesis, achieving high-quality photorealistic results in as few as 9 sampling steps. The model excels in speed and efficiency, making it suitable for real-time workflows and local deployment on consumer hardware.

Key features include photorealistic image generation with refined lighting, clean textures, and strong composition, alongside accurate bilingual text rendering in English and Chinese. It incorporates advanced world knowledge and semantic reasoning for handling complex prompts and culturally grounded concepts. What sets it apart is the Scalable Single-Stream Multi-Modal Diffusion Transformer (S3-DiT) architecture, which processes text, embeddings, and noisy latents in a unified sequence, enabling dense cross-modal interactions and superior performance at a compact scale compared to larger models.

This architecture, combined with a unique training strategy leveraging real-world data streams, allows Z-Image-Turbo to outperform previous state-of-the-art open-source models in speed and cost-effectiveness while maintaining competitive quality, as validated in benchmarks like Alibaba AI Arena.

Technical Specifications

  • Architecture: Scalable Single-Stream Multi-Modal Diffusion Transformer (S3-DiT)
  • Parameters: 6B
  • Resolution: Not explicitly specified; supports high-fidelity photorealistic outputs
  • Input/Output formats: Text prompts to images; supports bilingual text rendering; compatible with FP8, AIO, GGUF, BF16 quantized variants
  • Performance metrics: Generates 100 images in 279 seconds (4:39 min); 9-step inference at ~9 seconds per image on 24GB GPU; subsecond latency on high-end GPUs; outperforms competitors like Flux.2 Dev (19:12 min for 100 images) and Ovis-Image (8:28 min) in speed

Key Considerations

  • Use minimal sampling steps (e.g., 9) for maximum speed, but increase to 20+ for higher detail in complex scenes
  • Optimize VRAM usage with quantized versions like FP8 or GGUF to fit on 16-24GB consumer GPUs
  • Balance quality and speed: lower steps prioritize rapidity but may reduce fine details compared to larger models
  • Prompt with clear, descriptive language emphasizing style, lighting, and composition for best photorealism
  • Avoid overly abstract or highly intricate prompts initially, as the model's distillation favors straightforward semantic understanding
  • Test on local hardware to account for variability in inference time based on GPU and optimizations

Tips & Tricks

  • Optimal parameter settings: 9-16 sampling steps, CFG scale 3-7, use BF16 or FP8 for speed on consumer GPUs
  • Prompt structuring advice: Start with subject description, add style qualifiers (e.g., "photorealistic, sharp lighting"), specify bilingual text needs explicitly
  • Achieve photorealism: Include terms like "clean textures, balanced composition, refined lighting" in prompts
  • Iterative refinement: Generate initial low-step outputs, then upscale or refine with higher steps using the same seed
  • Advanced techniques: Leverage S3-DiT for complex scenes by chaining prompts with semantic details (e.g., "culturally accurate Chinese festival scene with English signage"); experiment with GGUF workflows for MacOS compatibility, resolving float8 conversion errors via custom nodes

Capabilities

  • Generates high-quality photorealistic images with excellent detail preservation at ultra-low latency
  • Accurate bilingual text rendering in posters, graphics, and small fonts with proper alignment and typography
  • Strong semantic reasoning and world knowledge for logical, culturally grounded outputs
  • Versatile across styles, from realistic scenes to creative compositions, matching larger models in fidelity
  • Efficient local inference on 16-24GB GPUs, enabling real-time generation
  • Superior speed in benchmarks, nearly twice as fast as next competitors for batch processing

What Can I Use It For?

  • Rapid prototyping of visual concepts in creative workflows, as noted in reviews for quick iterations
  • Generating photorealistic product visuals or marketing graphics with bilingual text support
  • Local offline image creation for personal projects, highlighted in user tests on consumer GPUs
  • Real-time applications like dynamic content generation, praised for subsecond latency potential
  • High-volume batch processing, demonstrated in benchmarks producing 100 images in under 5 minutes

Things to Be Aware Of

  • Runs efficiently on 24GB GPUs like mobile 5090, with usage closer to 24GB unoptimized; quantized versions reduce to 16GB
  • Outputs closely resemble leading models like Flux.2 Dev in quality but with extreme speed trade-off
  • Common MacOS issues include KSampler float8 conversion errors, resolvable with GGUF custom nodes
  • Consistent high aesthetic quality in benchmarks, especially photorealism, but may lack ultra-fine details of massive models
  • Positive feedback on speed and local runnability: "one of the fastest offline models I've seen" and "fantastic overall"
  • Variability in generation time (e.g., 9 seconds to slightly longer for complex prompts) based on hardware and optimizations

Limitations

  • Distilled design prioritizes speed over maximum detail, potentially underperforming larger models in hyper-intricate or artistic nano-level quality
  • Higher VRAM usage than expected without quantization (up to 24GB); may require optimizations for lower-end hardware
  • Experimental quantized variants (FP8, GGUF) can encounter platform-specific errors like float8 issues on MacOS