Z Image | Turbo | Lora

each::sense is in private beta.
Eachlabs | AI Workflows for app builders
z-image-turbo-lora

Z-IMAGE

A text-to-image endpoint with LoRA support, powered by Tongyi-MAI’s ultra-fast 6B Z-Image Turbo model for efficient, high-quality image generation.

Avg Run Time: 10.000s

Model Slug: z-image-turbo-lora

Release Date: December 8, 2025

Playground

Input

Output

Example Result

Preview and download your result.

z-image-turbo-lora
Your request will cost $0.009 per megapixel for output.

API & SDK

Create a Prediction

Send a POST request to create a new prediction. This will return a prediction ID that you'll use to check the result. The request should include your model inputs and API key.

Get Prediction Result

Poll the prediction endpoint with the prediction ID until the result is ready. The API uses long-polling, so you'll need to repeatedly check until you receive a success status.

Readme

Table of Contents
Overview
Technical Specifications
Key Considerations
Tips & Tricks
Capabilities
What Can I Use It For?
Things to Be Aware Of
Limitations

Overview

Z-Image-Turbo is a 6B-parameter text-to-image diffusion model developed by Alibaba/Tongyi-MA from the original Z-Image family, designed specifically for ultra-fast, high-quality image generation on commodity GPUs. It is a distilled “turbo” variant of the base Z-Image model, optimized to reach competitive visual quality in as few as 8–9 denoising steps, which dramatically reduces latency compared with traditional diffusion models. Community benchmarks and reviews consistently highlight it as one of the fastest open image generators currently available while remaining relatively lightweight and easy to run locally on 16–24 GB consumer GPUs.

Technically, Z-Image-Turbo is built on a Scalable Single-Stream Multi-Modal Diffusion Transformer (often referred to as S3-DiT), which integrates text and image features in a single transformer stream to maximize cross-modal interaction at every layer. This architecture, combined with a data pipeline tuned for real-world images and multilingual prompts (notably English and Chinese), allows the model to deliver strong photorealism, solid text rendering, and flexible style control at a fraction of the size of many current state-of-the-art models. The “z-image-turbo-lora” configuration layers LoRA support on top, enabling efficient fine-tuning and style adaptation without retraining the full 6B model, a setup that practitioners use extensively for custom aesthetics and domain-specific outputs.

Technical Specifications

  • Architecture: Scalable Single-Stream Multi-Modal Diffusion Transformer (S3-DiT, also described as Scalable Single-Stream Diffusion Transformer)
  • Model family: Distilled “Turbo” variant of the original Z-Image model
  • Parameters: Approximately 6 billion parameters
  • Diffusion steps: Typical high-quality workflows use 8–12 steps; the model is advertised to work well at around 8–9 steps for turbo use cases
  • Supported resolutions (community usage):
  • Commonly used at 512×512, 768×768, and 1024×1024, with users reporting stable performance at 1024×1024 on 16–24 GB GPUs
  • Higher resolutions (e.g., 1536×1536 and tiled outputs) are possible with careful VRAM management and/or multi-pass workflows as shown in community guides
  • Input formats:
  • Text prompts in English and Chinese are explicitly tested and reported as well supported
  • Negative prompts are commonly used to control artifacts and style in standard diffusion-style workflows
  • For LoRA, low-rank adaptation weights are typically loaded as separate files on top of the base Z-Image(-Turbo) weights
  • Output formats:
  • RGB images, most often saved as PNG or JPEG in community tooling (format is generally dictated by the client application rather than the core model)
  • Latency/performance (from public benchmarks and user tests):
  • A community benchmark generating 100 images found Z-Image-Turbo completed in about 279 seconds versus 507.9 seconds for Ovis-Image and 1152 seconds for Flux.2 Dev, making it nearly twice as fast as the second-fastest model in that test and much faster than heavier competitors.
  • A separate reviewer running locally on a mobile RTX 5090 (24 GB) reports single-image generations in roughly 9 seconds at typical settings, calling it “one of the fastest offline image generation models” they have tested.
  • Another user on an RTX A4500 (~20 GB) reports run times in the 13–88 second range depending on prompt complexity and batch size.
  • VRAM requirements (community-reported):
  • Runs comfortably on 16 GB VRAM for 1–2 images at mainstream resolutions; 20–24 GB VRAM gives more headroom for higher resolutions and batches.
  • LoRA support:
  • LoRA training and inference are commonly demonstrated using the base Z-Image for training and Z-Image-Turbo for fast inference, with tutorials explicitly covering Z-Image-Turbo LoRA workflows.

Key Considerations

  • Be aware that Z-Image-Turbo is a distilled speed-optimized variant; for LoRA training, community experts recommend using the base Z-Image model and then applying the resulting LoRA to Z-Image-Turbo for fastest inference.
  • The model responds strongly to detailed prompts; vague prompts still work reasonably well, but precise descriptions (subjects, lighting, composition, style) consistently improve output quality.
  • Very low step counts (e.g., 4–6) can be used for ultra-low-latency previewing but may introduce more noise, artifacts, or weaker fine details; most users settle on 8–12 steps as the quality–speed sweet spot.
  • Text rendering and signage are generally strong compared with many older open models, but still benefit from explicit formatting and short, clear wording in the prompt.
  • The model handles both photorealistic and stylized outputs, but communities note that photorealism is its strongest area; for highly stylized or painterly outputs, additional style LoRAs or style-heavy prompts can be helpful.
  • Negative prompts like “blurry, extra limbs, distorted hands, low contrast, low detail” are widely used to reduce common diffusion artifacts and increase consistency.
  • VRAM usage grows quickly with resolution and batch size; users on 8–12 GB GPUs often reduce resolution, batch size, or use more aggressive optimization modes, whereas 16–24 GB cards can handle higher resolutions more comfortably.
  • For LoRA workflows, training with too small or too homogeneous a dataset can lead to overfitting and style “overpowering” the base model; tutorials emphasize balanced datasets and conservative LoRA ranks and learning rates.
  • Because it is relatively new, tooling, configs, and community best practices are evolving; tracking recent benchmarks and configuration guides can significantly improve results.

Tips & Tricks

  • General parameter guidance (community patterns):
  • Steps:
  • 8–9 steps for “turbo” production workflows where speed matters most.
  • 12–20 steps for highest quality or very complex compositions, at the cost of slower inference.
  • CFG scale:
  • Many users report best results in the 4–7 range; higher values can cause oversaturation or prompt overfitting.
  • Sampler:
  • Fast samplers tuned for few-step diffusion (e.g., DPM variants) are commonly used; community benchmarks often pair Z-Image-Turbo with such samplers for best speed–quality trade-offs.
  • Prompt structuring:
  • Start with a clear subject + action + setting, then add style and technical modifiers (e.g., “a portrait of an elderly woman, natural light, shallow depth of field, 85mm lens, ultra-detailed, high dynamic range”); users show this structure yields consistent, well-framed outputs.
  • Put the most important elements early in the prompt and avoid overlong lists of modifiers that can dilute the model’s focus.
  • For multilingual or bilingual prompts, users report best results by keeping the critical description in a single language (English or Chinese) rather than mixing mid-sentence, to avoid ambiguity.
  • Achieving specific looks:
  • Photoreal portraits: Emphasize camera/lens terms, lighting (e.g., studio light, softbox, rim light), skin texture, and age/gender descriptors. Community examples show sharp, realistic portraits with 8–12 steps and moderate CFG.
  • Cinematic scenes: Add film-related modifiers like “cinematic lighting, 35mm film, anamorphic, cinematic color grading, volumetric light” and use higher resolution (e.g., 1024×576 or 1024×1024) for more environmental detail.
  • Game/anime styles: Users often combine Z-Image-Turbo with style LoRAs or strong style keywords (e.g., “anime key visual, cel shading, thick line art”) to push it into non-photoreal domains.
  • Iterative refinement:
  • Generate an initial batch at low steps and resolution to explore composition and pose, then upscale or re-render the best candidates at higher resolution and slightly more steps for final output.
  • Use prompt editing: take the best seed, then iteratively adjust small parts of the prompt (lighting, background, clothing details) rather than rewriting everything; users report significantly better convergence this way.
  • LoRA-specific tips:
  • Train LoRAs on the base Z-Image model rather than Turbo for higher fidelity, then load them onto Z-Image-Turbo for production inference.
  • Keep LoRA rank and learning rate conservative to avoid overfitting; tutorials show that lower-rank LoRAs can still capture strong styles while remaining stable across prompts.
  • When applying LoRAs, start with a modest strength (e.g., 0.6–0.8) and adjust upward only if the style is not visible enough; too high a strength can overpower the base model’s realism.

Capabilities

  • Strong performance in photorealistic image generation across portraits, objects, and complex scenes, even at relatively low step counts.
  • Efficient multilingual prompt handling, with explicit support and good results reported for both English and Chinese text instructions.
  • Very fast inference relative to other open models of similar or larger size, with both synthetic benchmarks and real user tests consistently confirming its speed advantage.
  • Good text rendering capabilities (e.g., logos, signs, UI elements) compared with earlier diffusion models, especially when prompts are concise and clear.
  • Flexible style range, from photography to illustrative and cinematic looks, which can be extended further using LoRAs for specific art styles or domains.
  • LoRA-friendly design: supports low-rank adaptation for quick specialization, and community workflows show successful training of custom character and style LoRAs with modest hardware (often under 12–16 GB VRAM).
  • Scales well with hardware: runs on mid-range consumer GPUs (16 GB) and benefits strongly from 20–24 GB VRAM for higher resolutions and batch sizes.
  • Competitive cost-effectiveness: benchmarks describe it as ahead of peers in speed and resource efficiency for large-batch generation, making it attractive for production workloads.

What Can I Use It For?

  • Professional and production workflows:
  • High-volume asset generation for games and media, where benchmarks show it can generate large batches significantly faster than competing models, reducing turnaround time and compute costs.
  • Rapid prototyping of visual concepts in design studios and creative agencies, with reviewers explicitly highlighting its ability to run quickly and locally on standard GPUs for iterative work.
  • Internal tooling for content teams needing fast, on-demand imagery (product mockups, mood boards, storyboards) powered by a relatively compact 6B model.
  • Creative community projects:
  • Character design and illustration: community tutorials demonstrate training character LoRAs on Z-Image/Z-Image-Turbo and using them to create consistent characters across multiple scenes.
  • Stylized artwork and fan art leveraging LoRA fine-tuning, where users share workflows for teaching the model new aesthetics while keeping inference fast.
  • Cinematic stills and concept art shared in image-generation forums, often citing Z-Image-Turbo’s strong composition and lighting out of short prompts.
  • Business and industry use cases:
  • Marketing visuals and social media content where low-latency generation and controllable aesthetics are important; blogs point to its speed and cost-effectiveness as major advantages in these scenarios.
  • E-commerce and product visualization, using detailed prompts and potentially LoRAs trained on specific product lines to create catalog-like images or variations.
  • UX/UI concept imagery and mockups for interfaces, dashboards, and iconography, leveraging the model’s text and layout handling.
  • Personal and open-source projects:
  • Local image-generation setups on consumer GPUs (e.g., 16–24 GB cards), with reviewers explicitly documenting home-lab style deployments and settings for desktops and laptops.
  • Open-source pipelines and automation scripts on GitHub that integrate Z-Image-Turbo for batch rendering, dataset synthesis, or experimental pipelines.
  • Educational experiments in diffusion, LoRA training, and benchmarking, as seen in community videos that stress-test the model with many prompts and configurations.
  • Domain-specific applications:
  • Technical and scientific illustration (e.g., diagrams, conceptual schematics) where fast iteration matters more than hyper-realistic detail.
  • Niche aesthetic generators where users train domain-specific LoRAs (e.g., particular art movements, cultural motifs, or brand-specific styles) on top of Z-Image.

Things to Be Aware Of

  • As a distilled turbo model, Z-Image-Turbo slightly trades maximum possible fidelity for speed; some reviewers note that at very high scrutiny, the finest micro-details can lag behind the heaviest state-of-the-art models, especially at extremely low step counts.
  • Community guides emphasize that for LoRA training, the base Z-Image model is preferable; training directly on Turbo may work but is less commonly recommended, and may not generalize as well.
  • Earlier versions of community configs showed some instability or inconsistency at very low steps (e.g., 4–5), with more noise and structural artifacts; most users stabilize results by moving to 8–12 steps.
  • Hands, small objects, and intricate geometry can still exhibit typical diffusion artifacts (extra fingers, fused objects) if prompts are underspecified; targeted negative prompts and slightly more steps help mitigate these issues.
  • VRAM usage can spike when using high resolutions or multi-image batches; benchmarks on multiple machines show that while it can run on 8–12 GB setups with aggressive optimizations, the best experience is reported on 16–24 GB GPUs.
  • Positive user feedback themes:
  • Consistently praised for its speed and responsiveness, repeatedly described as among the fastest locally runnable image models users have tried.
  • Appreciated for strong photorealism and good text rendering with relatively simple prompts, reducing the need for extremely elaborate prompt engineering.
  • Viewed as highly practical for real production workloads because of its efficiency and open nature.
  • Common concerns or negative patterns:
  • Some users note that out-of-the-box style variety is more limited than large, heavily-trained generalist models; they rely on LoRAs or stronger style prompting for more exotic or niche aesthetics.
  • A few benchmarks highlight that while it is exceptionally fast, its absolute peak quality in highly demanding artistic scenarios may be slightly behind the largest contemporary closed models, especially at high resolutions.
  • Since the ecosystem is still maturing, configuration defaults, best samplers, and recommended parameters are evolving, and early tutorials sometimes conflict; users often need to test several configs before settling on optimal settings.

Limitations

  • Being a 6B distilled turbo model, Z-Image-Turbo is optimized for speed and efficiency rather than absolute peak fidelity; in extremely detailed or high-resolution artistic tasks, heavier models can sometimes surpass its fine-grain quality.
  • At very low step counts or on low-VRAM hardware with aggressive optimization, output quality and structural coherence can degrade, leading to more artifacts and inconsistencies.
  • Without LoRAs or very deliberate style prompting, its default style range, while competent, may be less diverse than that of larger, heavily specialized models, making it less optimal for some highly niche or experimental visual aesthetics.