Z-IMAGE
Generates images from text combined with edge, depth, or pose inputs using Tongyi-MAI’s ultra-fast 6B Z-Image Turbo model for precise and high-quality results.
Avg Run Time: 12.000s
Model Slug: z-image-turbo-controlnet
Playground
Input
Enter a URL or choose a file from your computer.
Invalid URL.
(Max 50MB)
Output
Example Result
Preview and download your result.

API & SDK
Create a Prediction
Send a POST request to create a new prediction. This will return a prediction ID that you'll use to check the result. The request should include your model inputs and API key.
Get Prediction Result
Poll the prediction endpoint with the prediction ID until the result is ready. The API uses long-polling, so you'll need to repeatedly check until you receive a success status.
Readme
Overview
Z-Image-Turbo-ControlNet (often referred to in community content as Z-Image-Turbo Fun ControlNet or Z-Image-Turbo-Fun-Controlnet-Union) is a ControlNet-augmented variant of Alibaba Cloud Tongyi Lab’s Z-Image-Turbo 6B image generation model. It extends the base ultra-fast text-to-image model with explicit conditional controls such as edges, depth, and human pose, enabling users to guide layout and structure while preserving the fast inference and photorealistic quality of Z-Image-Turbo. It is targeted at developers and technical artists who need controllable generation for design, e‑commerce visuals, game art, and pose-driven image synthesis.
The model integrates a Union/“Fun” ControlNet design that can ingest multiple control modalities (Canny, HED, depth, pose, MLSD/lines, etc.) in a single architecture, while remaining relatively lightweight in VRAM use compared to many ControlNet stacks. Compared to the base Z‑Image‑Turbo, it trades some raw speed for structured, “zero-distortion” control over composition and pose, and adds features such as inpainting and multi-condition fusion that are emphasized in community workflows (especially ComfyUI tutorials and speed tests).
Technical Specifications
- Architecture: Single-stream diffusion model with integrated ControlNet blocks (Z-Image-Turbo base + Fun/Union ControlNet structure in six core blocks)
- Parameters: Approximately 6B parameters (inherits Z-Image-Turbo 6B distilled architecture)
- Resolution:
- Native training / control resolution around 1328 px on the long side for ControlNet-Fun 2.0 training
- Common inference resolutions:
- 1024×1024 (documented fast path and typical benchmark resolution)
- 2048×2048 supported with higher compute cost
- Input formats:
- Text prompts (Chinese, English, and mixed bilingual prompts supported)
- Conditioning images:
- Edge maps (Canny, HED)
- Depth maps (Depth Anything v2, ZoeDepth, similar depth preprocessors)
- Pose maps (e.g., DW Pose / OpenPose-like skeleton maps)
- Line/MLSD (for architecture and line drawings)
- Masks for inpainting/editing (in ControlNet 2.0 workflows)
- Control encodings typically provided as single-channel or 3‑channel images after preprocessing pipelines (e.g., auxiliary preprocessors in ComfyUI)
- Output formats:
- RGB images, typically 8‑bit per channel
- Common sizes: 1024×1024; larger sizes via upscaling workflows (e.g., 2K) described in benchmarks and tutorials
- Performance metrics (from available benchmarks and user tests):
- Base Z-Image-Turbo (no ControlNet), RTX-class GPU:
- ~4 seconds per 1024×1024 image at 10 steps on one reported setup
- ~43 seconds per 2048×2048 image at 10 steps
- With ControlNet + reference image:
- ~16 seconds at 1024 resolution, 10 steps (same hardware)
- ~179 seconds at 2048 resolution, 10 steps
- Official lab benchmark: 1024×1024 image in 8 sampling steps in as low as 9 seconds on an RTX 4080 for base Turbo model
- VRAM: designed to run from ~6 GB VRAM upward for Turbo+ControlNet union, significantly below many traditional ControlNet setups according to lab write-up
Key Considerations
- When to use ControlNet:
- Use Z-Image-Turbo-ControlNet when you need structural fidelity (pose, depth, architecture lines) more than pure speed. For unconstrained creative exploration, the base Turbo model without ControlNet is faster and often sufficient.
- Quality vs speed:
- ControlNet increases inference time substantially, especially at high resolutions (e.g., 1024→2048 jump can multiply runtime several times).
- Lower resolutions (≤1024) and moderate step counts (8–20) are a good balance for interactive workflows; 2K images or higher may be best reserved for final renders or batch jobs.
- Control strength tuning:
- Overly strong control weights can cause “over-anchored” images that look stiff or distort local appearance; too weak and the model drifts from the guide.
- The recommended controlcontextscale (or related control strength parameter) range reported by practitioners is roughly 0.65–0.90, adjusted per use case.
- Prompt design:
- Natural-language prompts should still specify style, lighting, and subject details; the control inputs primarily constrain geometry, not aesthetics. Under-specifying style often results in generic or inconsistent outputs.
- Preprocessing quality:
- Poor edge, pose, or depth maps lead to artifacts such as broken limbs, warped perspective, or jittery outlines. Use high-quality preprocessors (Depth Anything v2, ZoeDepth, DW Pose, HED) with parameters tuned for clean but informative maps.
- Mixed-language prompts:
- The underlying Z-Image-Turbo is optimized for mixed Chinese/English prompts; be explicit and avoid ambiguous phrasing, especially for niche concepts or specialized terminology, to avoid misinterpretations.
- Inpainting behaviors:
- Inpainting mode can sometimes blur unmasked regions or slightly alter content outside the mask; use tight masks and consider multi-pass refinement when precision is critical.
- CFG / guidance scale:
- Community feedback emphasizes that relatively low CFG (2–6) often yields natural, stable results; very high CFG may cause oversaturated or brittle images, especially under strong ControlNet conditioning.
- Hardware:
- While VRAM requirements are lower than many competing ControlNet stacks, multi-condition union plus high resolution (e.g., 2048) can still be demanding; plan for sufficient VRAM and consider bfloat16/half-precision to stay within limits.
Tips & Tricks
- Control and prompt configuration:
- Start with:
- Steps: 10–16 for 1024×1024, increase toward 20–30 when using multiple control maps or very strict control.
- CFG / guidance: 3–6 for photorealism and stability; lower values (2–3) are praised in community tests for reducing artifacts and improving coherence with ControlNet.
- Control strength (controlcontextscale or equivalent): 0.7–0.85 as a starting point, then tune based on rigidity vs creativity.
- Keep prompts structured:
- Subject: “full-body portrait of a young woman in a red dress”
- Style: “cinematic lighting, 35mm photography, shallow depth of field, high dynamic range”
- Quality tags: “high detail, sharp focus, 8k, photorealistic”
- Negative prompt: “no extra limbs, no distorted hands, no text, no watermark, no blur”
- Achieving strong pose-controlled portraits:
- Use a clean DW Pose or OpenPose-style skeleton extracted from a simple reference pose image.
- Keep clothing and camera angle in prompt aligned with the reference pose; mismatches (e.g., overhead pose with “close-up portrait”) cause perspective conflicts.
- Start with medium control strength (~0.75) and increase only if the model deviates from the desired pose.
- From sketch/line-art to final render:
- Generate a Canny or MLSD map from your sketch or CAD-like drawing and feed it as the primary control condition.
- Emphasize material and mood in the text prompt (e.g., “white concrete building, sunset lighting, ultra wide-angle, realistic shadows”) while letting lines define geometry.
- If lines are noisy, blur or dilate them before Canny/MLSD extraction to avoid jagged or flickering edges.
- Depth-guided composition:
- Use Depth Anything v2 or ZoeDepth to produce smooth depth maps from 3D renders or reference photos.
- For complex scenes (architecture, interiors), keep control strength relatively high to retain perspective; for portraits, moderate strength avoids flattening facial features.
- Inpainting and local edits (where supported by the ControlNet 2.0 variant):
- Use tight masks; do not include large unaffected areas to minimize unexpected changes.
- Work in stages: first inpaint background, then separately refine foreground details such as clothing or accessories.
- Keep prompts short and focused on the edit (“replace background with modern office interior, realistic lighting”) rather than re-describing the whole scene to reduce drift.
- Upscaling and quality optimization:
- A common user strategy is:
- Generate at 1024×1024 with ControlNet to lock in pose/structure quickly.
- Then upscale (2×) with a separate upscaling or refinement pass that may use weaker or no ControlNet, focusing on texture and micro-detail.
- Community benchmarks suggest avoiding simple Euler samplers; higher-quality samplers (e.g., DPM-like or other advanced samplers within the ecosystem) often produce cleaner detail for the same step count.
- Iterative refinement:
- Run small batches (e.g., 4–8 seeds) at low steps (8–10) to explore prompt and control combinations.
- Once a promising seed is found, lock the seed and scale up steps and resolution for final output, optionally re-tuning control strength for better balance.
Capabilities
- High-quality, photorealistic image generation at 1024×1024 with strong detail in skin, hair, and lighting, inherited from the Z-Image-Turbo base model.
- Multi-modal control using edges, HED, depth, pose, and line/MLSD, including union of multiple conditions in one model, enabling precise structural guidance.
- Robust pose-controlled human generation: community benchmarks report high recognizability and naturalness in celebrity- and idol-style portraits guided by reference poses.
- Strong performance on architectural and product renders when driven by line drawings or depth maps, useful for e‑commerce visuals and design mockups.
- Relatively low VRAM requirements for a 6B model with integrated ControlNet, allowing use on mid-range GPUs while still handling 1024px+ resolutions.
- Mixed Chinese/English prompt understanding, which is valuable for multilingual teams and datasets.
- Inpainting/editing mode (in ControlNet 2.0 variants) supporting local edits, background replacement, and iterative refinement workflows.
- Stable behavior at low CFG scales, which many users report as producing reliable and natural results even on challenging prompts.
- Fast inference relative to many comparable controllable diffusion models, especially at 1024px, though slower than base Turbo due to control overhead.
What Can I Use It For?
- Professional and commercial visuals:
- E‑commerce product images composed from line sketches or rough renders, ensuring exact pose/angle alignment while generating polished, photorealistic imagery.
- Architectural concept visuals generated from CAD-like line drawings or MLSD maps, retaining structural accuracy while exploring materials, lighting, and style.
- Marketing imagery and key visuals guided by compositional sketches, enabling art directors to fix layout while the model explores textures and moods.
- Media, film, and game production:
- Pre-visualization of scenes using pose and depth control to quickly generate storyboard-quality frames with consistent character poses.
- Game character and costume concept art driven by pose references, allowing designers to iterate on outfits while locking in character stance.
- Background and environment concepts built from depth or line guides for fast iteration in pre-production workflows.
- Creative and community projects:
- Pose-to-portrait transformations, such as turning photos or stick-figure poses into stylized or realistic character art, widely showcased in tutorial videos and community galleries.
- “From sketch to painting” workflows where artists feed their own line art plus prompts for color, material, and atmosphere.
- Fan art and celebrity-style images controlled by pose, where community experiments report strong subject recognizability and natural aesthetics.
- Business and industry specific uses:
- Fashion and apparel visualization by conditioning on pose skeletons plus prompts describing fabric, cut, and style, useful for catalog or lookbook ideation.
- Interior design moodboards generated from simple room layouts or depth maps, with prompts describing furniture style and lighting.
- Rapid A/B testing of visual campaigns where layout must remain fixed (from a wireframe or Canny edge map) while colors and styles are varied via prompts.
- Developer and research use:
- Automated pipelines that convert line drawings, CAD exports, or 3D renders into realistic imagery, leveraging ControlNet union for multi-modal control.
- Benchmarking and research into controllable diffusion, as seen in community tests comparing Z-Image-Turbo-ControlNet to other open models on speed and recognizability.
- Training and evaluation of auxiliary preprocessors (depth, pose, edges) in real production-like workflows.
Things to Be Aware Of
- Experimental/advanced behaviors:
- The Fun/Union ControlNet design integrates multiple control modalities; while powerful, some community discussions note that combining many controls simultaneously can make behavior harder to predict and tune, requiring experimentation with strengths and step counts.
- Inpainting mode in 2.0 is relatively new; users report occasional blurring or subtle changes in unmasked regions, so it is best treated as an advanced feature needing careful masking and iterative testing.
- Quirks and edge cases:
- Complex hand poses or interactions (hands touching objects, overlapping limbs) may still produce artifacts or anatomically incorrect results, often requiring either lower control strength or additional manual post-processing.
- If the control image resolution or aspect ratio differs greatly from the target resolution, misalignment or stretching can occur; community workflows emphasize consistent resizing/scaling in the preprocessing pipeline.
- Very noisy or overly detailed Canny/HED maps may cause “jittery” or over-etched edges, particularly in hair and fine textures; smoothing or threshold tuning is recommended.
- Performance considerations:
- User benchmarks indicate a noticeable slowdown when enabling ControlNet compared to base Turbo: e.g., 4 seconds vs ~16 seconds at 1024px, and 43 vs ~179 seconds at 2048px on one reported setup.
- Multi-condition control and high resolutions amplify VRAM usage; while still relatively efficient, some users with low-VRAM GPUs report needing reduced batch sizes, lower resolutions, or mixed-precision to avoid out-of-memory errors.
- Consistency factors:
- Seed sensitivity: once a seed and control combination are found that work well, locking the seed is important for reproducible iteration; small changes in control maps or prompts can produce significantly different outputs even with the same seed.
- Low CFG (2–3) has been highlighted in lab write-ups and user tests as a sweet spot for stability; higher CFG values may increase saturation and contrast but sometimes destabilize structure, especially under strong control.
- Positive user feedback themes:
- Many users praise the balance of control, realism, and speed, especially compared to heavier ControlNet setups that require much more VRAM or have slower inference.
- Community posts highlight strong pose-guided portraits and good architectural fidelity from line drawing inputs, as well as good multilingual prompt handling.
- Tutorials and benchmarks report that once correctly configured, Z-Image-Turbo-ControlNet is “surprisingly stable” at low CFG and medium step counts, making it practical for day-to-day creative work.
- Common concerns or negative patterns:
- Some users note the jump in render times when combining high resolution (2K), multiple control maps, and high step counts, which can feel at odds with the “Turbo” branding if expectations are not calibrated.
- A few reports mention that default sampler choices (e.g., basic Euler) underperform; higher-quality samplers are almost necessary to unlock the full image quality potential.
- Certain edge cases (hands, overlapping limbs, very busy backgrounds) may still require manual correction or multi-pass workflows, especially when using very rigid control settings.
Limitations
- While faster and lighter than many ControlNet-based models, Z-Image-Turbo-ControlNet still incurs a significant speed and VRAM cost compared to the base Turbo model, especially at resolutions beyond 1024×1024 or when using multiple control conditions simultaneously.
- Structural control is strong but not perfect: complex poses, hands, and intricate object interactions can still produce artifacts or require careful tuning of control strength, steps, and prompts to avoid distortions.
- Inpainting and advanced multi-condition fusion are relatively new and may exhibit inconsistent behavior or require non-trivial workflow setup and experimentation, making them less ideal for situations demanding strict, production-grade determinism without manual oversight.
Related AI Models
You can seamlessly integrate advanced AI capabilities into your applications without the hassle of managing complex infrastructure.
