Z Image | Turbo | Controlnet | Lora

each::sense is in private beta.
Eachlabs | AI Workflows for app builders
z-image-turbo-controlnet-lora

Z-IMAGE

Generates images from text combined with edge, depth, or pose inputs using custom LoRA and Tongyi-MAI’s ultra-fast 6B Z-Image Turbo model for fast, high-quality, and controllable image creation.

Avg Run Time: 13.000s

Model Slug: z-image-turbo-controlnet-lora

Playground

Input

Enter a URL or choose a file from your computer.

Output

Example Result

Preview and download your result.

z-image-turbo-controlnet-lora
Your request will cost $0.010 per megapixel for input and $0.010 per megapixel for output.

API & SDK

Create a Prediction

Send a POST request to create a new prediction. This will return a prediction ID that you'll use to check the result. The request should include your model inputs and API key.

Get Prediction Result

Poll the prediction endpoint with the prediction ID until the result is ready. The API uses long-polling, so you'll need to repeatedly check until you receive a success status.

Readme

Table of Contents
Overview
Technical Specifications
Key Considerations
Tips & Tricks
Capabilities
What Can I Use It For?
Things to Be Aware Of
Limitations

Overview

Z-Image-Turbo-ControlNet-LoRA is a specialized image generation model developed by Aliyun Tongyi Lab (Alibaba's Tongyi Lab), building on the ultra-fast Z-Image Turbo base model, which is a 6 billion parameter distilled version optimized for high-speed inference. It integrates a custom LoRA for fine-tuning and a Union ControlNet that fuses multiple control conditions like Canny edges, depth maps, and pose inputs, enabling precise, controllable text-to-image generation with photorealistic quality. Released in late 2025, it supports workflows for rapid creation while maintaining low hardware demands, making it suitable for real-time applications.

The model's key strength lies in its single-stream diffusion architecture, which achieves "zero-distortion" manipulation by jointly processing posture, edge, and depth information alongside text prompts. This allows for seamless control over elements like character expressions, scene objects, and compositions, with support for mixed Chinese-English prompts and high-fidelity details in skin texture, hair, and lighting. Users highlight its efficiency, generating 1024x1024 images in as few as 8 steps or 9 seconds on RTX 4080 hardware, positioning it as a competitive open-source alternative for fast, high-quality outputs.

What sets it apart is the lightweight Union ControlNet design, compatible with just 6GB VRAM, and its open-source nature fostering community experiments like celebrity face generation and pose-guided rendering. Recent benchmarks praise its stable performance at low CFG scales (2-3) and natural recognizability, outperforming larger models in speed while matching quality in controlled scenarios.

Technical Specifications

  • Architecture: Single-stream diffusion with Union ControlNet (Canny, MSED, HED, Pose, Depth) and custom LoRA integration on Z-Image Turbo base
  • Parameters: 6 billion (base model)
  • Resolution: 1024x1024 native; supports up to 2048x2048 with performance scaling
  • Input/Output formats: Text prompts + control images (edge/depth/pose); outputs photorealistic images in standard tensor formats (e.g., safetensors BF16/FP8/GGUF)
  • Performance metrics: 9 seconds for 1024x1024 at 8 steps (RTX 4080); 4s base vs 16s with ControlNet at 1024px; 43s base vs 179s with ControlNet at 2048px; 250s per 5 steps on low-end GPUs

Key Considerations

  • Requires updated workflows (e.g., latest nightly versions) for full ControlNet node support to avoid compatibility issues
  • Best practices: Use all-in-one auxiliary preprocessors for streamlined edge/depth/pose handling; start with strength 0.8-1.0 for balanced control
  • Common pitfalls: Over-relying on default Euler sampler limits diversity; high resolutions (2048px) significantly increase time with ControlNet (up to 3x base)
  • Quality vs speed trade-offs: ControlNet adds creative precision but triples inference time (e.g., 40s base to 190s total); prioritize low steps (5-10) for speed
  • Prompt engineering tips: Short, simple prompts (e.g., "Face", "Person") work best; reduce denoising strength to 0.7 for variation; low CFG (2-3) ensures stability

Tips & Tricks

  • Optimal parameter settings: 8-10 steps, denoising strength 0.7-1.0, ControlNet strength 0.8 for flexibility; FP8-E4M3FN for quality/speed balance
  • Prompt structuring advice: Use mixed Chinese-English for better rendering; keep concise to maximize diversity (e.g., "Avenger Movie Scene")
  • How to achieve specific results: For pose control, input reference poses via DWPose/Zoe Depth preprocessors; blend with text for expression tweaks
  • Iterative refinement strategies: Two-stage workflow - low-res (denoising <1.0) for variation, then img2img upscale for detail
  • Advanced techniques: Resolution staging (low for speed/variation, high for refinement); custom samplers over Euler for unique outputs; LoRA for style customization

Capabilities

  • Excels in photorealistic rendering with fine details in skin, hair, lighting, and textures using just 6B parameters
  • Multi-condition fusion for precise control over poses, edges, depth without distortion, enabling sketch-to-product pipelines
  • Ultra-fast inference: Sub-second on high-end GPUs, viable on 6GB VRAM consumer cards for real-time generation
  • High versatility: Supports celebrity/K-pop face recognition, scene manipulation, mixed-language prompts with natural outputs
  • Strong adaptability: Stable at low CFG (2-3), high diversity via denoising tweaks, matches/exceeds larger models like FLUX in speed/quality

What Can I Use It For?

  • E-commerce visual design: Automated pipelines from sketches/edges to product renders with depth/pose control
  • Film/TV special effects and game prototypes: Pose-guided character generation and scene object manipulation
  • Creative projects: Pose-controlled photorealistic portraits, celebrity recreations shared in community benchmarks
  • Personal projects: Fast img2img with ControlNet for custom scenes, as tested in user workflows for diversity experiments
  • Industry applications: Low-VRAM pose generation for animation prototypes and batch processing in design workflows

Things to Be Aware Of

  • Experimental Union ControlNet is the first for Z-Image Turbo, with enthusiastic community benchmarks on Reddit/X showing strong recognizability
  • Known quirks: GGUF variants need specific loader nodes; high-res ControlNet demands more time (179s at 2048px)
  • Performance from benchmarks: Base 4s/1024px jumps to 16s with controls; excels on RTX4080 but scales to low-end GPUs at 250s/5 steps
  • Resource requirements: Runs on 6GB VRAM, ideal for consumer hardware; BF16/FP8 for optimization
  • Consistency: High with low CFG, but standard settings yield similar outputs - tweak samplers/denoising for variation
  • Positive feedback: "Insane photorealism", "efficient advantage", "better than larger models" in speed/quality from recent tests
  • Common concerns: Initial attempts may lack polish but improve iteratively; avoid basic Euler for full potential

Limitations

  • ControlNet significantly slows generation (3-4x base time, e.g., 190s total workflow), limiting real-time use at high resolutions
  • Relies on preprocessors for controls, requiring workflow setup; less ideal for pure text-to-image without references where diversity needs tuning
  • First ControlNet release may have edge cases in non-standard poses/depths, with ongoing community refinements needed