Z-IMAGE
LoRA training for Z-Image models, allowing quick style and identity fine-tuning with stable, high-quality results.
Avg Run Time: 700.000s
Model Slug: z-image-trainer
Release Date: December 8, 2025
Playground
Input
Enter a URL or choose a file from your computer.
Invalid URL.
(Max 50MB)
Output
Example Result
Preview and download your result.
API & SDK
Create a Prediction
Send a POST request to create a new prediction. This will return a prediction ID that you'll use to check the result. The request should include your model inputs and API key.
Get Prediction Result
Poll the prediction endpoint with the prediction ID until the result is ready. The API uses long-polling, so you'll need to repeatedly check until you receive a success status.
Readme
Overview
Z-Image-Trainer (often referred to as a Z-Image LoRA trainer or Z-Image Turbo LoRA trainer in public documentation) is a LoRA fine-tuning workflow built specifically for the Z-Image family of image generation models from Tongyi-MAI, with a primary focus on the Z-Image-Turbo variant. It enables users to quickly train lightweight LoRA (Low-Rank Adaptation) adapters that encode custom styles, characters, identities, or domain-specific concepts while keeping the base Z-Image model frozen and stable. The trainer is generally used by technical creators, ML practitioners, and production teams that need consistent visual behavior tied to a specific style or subject, without paying the full cost of end-to-end model fine-tuning.
Under the hood, Z-Image-Trainer leverages the Z-Image architecture (a 6B-parameter Scalable Single-Stream Diffusion Transformer, S3-DiT) and, in most public guides, its Turbo distilled variant designed for very fast inference at 1024×1024 and beyond on consumer GPUs. The trainer exposes configuration options such as LoRA rank, learning rate, training steps, and training “mode” (content vs style vs balanced), and it includes task-specific optimizations like Turbo-aware fine-tuning (step-aware learning rate, safe ranks, and scaling) to maintain output quality at low sampling steps. Community guides and user reports emphasize its ability to reach strong style or identity capture with relatively small datasets and short runs, while preserving the base model’s speed and generalization.
Technical Specifications
- Architecture: Scalable Single-Stream Diffusion Transformer (S3-DiT) base (Z-Image 6B), typically using the Z-Image-Turbo distilled variant for LoRA training.
- Parameters: Base model approximately 6B parameters; LoRA adapters are low-rank matrices (rank commonly 8–32 or higher depending on configuration) added on top of attention and/or MLP layers.
- Resolution:
- Native training and inference commonly at 1024×1024 for Z-Image-Turbo, with support for flexible resolutions up to around 4 megapixels at inference.
- LoRA training guides generally assume 1024×1024 samples during training for best alignment with the base model.
- Input formats:
- Training: ZIP archive or structured dataset of images (PNG/JPEG) plus optional per-image text captions, or a default caption applied across images.
- Captions: Plain text; either one shared caption or individual .txt files per image (ROOT.txt naming patterns described in public guides).
- Output formats:
- LoRA weights compatible with common diffusion toolchains (e.g., Diffusers-style LoRA weight files plus JSON or equivalent config describing target modules and ranks).
- One or more LoRA adapters (e.g., base adapter, optional Turbo-aware adapter) depending on trainer configuration.
- Performance metrics (from public descriptions and user feedback, rather than formal benchmarks):
- Base model: 6B S3-DiT, Turbo variant using ~8 effective denoising steps and guidance distillation for rapid sampling while maintaining quality.
- Training speed (community reports): approximately 2–3 seconds per iteration on 12 GB GPU setups, with ~2000 steps taking about 1–2 hours.
- Dataset efficiency: decent style/identity capture often reported with 10–30 images, with diminishing returns past ~50 images unless covering a very broad stylistic range.
Key Considerations
- Z-Image-specific LoRA design:
- The trainer is tuned specifically for Z-Image / Z-Image-Turbo; LoRAs trained here are not drop-in compatible with unrelated diffusion backbones.
- Dataset design:
- Use diverse poses, lighting, and backgrounds so the LoRA learns the concept (character/style) rather than overfitting to specific scenes.
- For identity LoRAs, users report best results with 10–25 well-curated images; more is not always better if redundancy is high.
- Captioning strategy:
- Consistent and descriptive captions help separate the concept token (e.g., a unique name) from generic attributes.
- Avoid noisy or incorrect captions; community feedback notes that inaccurate captions can cause the LoRA to entangle unwanted attributes.
- Training steps vs overfitting:
- Many users report that Z-Image “learns hot,” reaching usable quality quickly; 1500–3000 steps is often enough, with smaller datasets (<10 images) favoring 1500–2200 steps to avoid overfitting.
- Rank and regularization:
- Higher LoRA ranks increase capacity but can lead to overspecialization and heavier memory usage; moderate ranks (e.g., 8–32) are common starting points.
- Quality vs speed trade-offs:
- At inference, Z-Image-Turbo uses 1–8 steps; LoRA users typically adopt 6–8 steps for final-quality outputs and fewer steps for previews.
- Very aggressive step reduction may slightly weaken fine detail from LoRA-driven styles or identities.
- Prompt engineering:
- Introduce a unique trigger token or phrase in training captions and reuse it in prompts to reliably invoke the trained style or subject.
- Combine the trigger with clear artistic, compositional, and lighting instructions; users report that Z-Image responds well to specific style descriptors (camera, lens, lighting, medium).
- Generalization vs specificity:
- To maintain generalization, mix training images across varied contexts and avoid oversaturating the dataset with near-duplicates.
- Monitor validation samples during training and stop when likeness/style looks right but before backgrounds and compositions become too “locked in.”
Tips & Tricks
- Recommended starting training configuration (from public guides and community usage):
- Model: Z-Image-Turbo with its built-in training adapter.
- Image count: 10–30 carefully curated images for a character or style; up to ~50 if covering a broad stylistic range.
- Steps:
- 1500–2200 steps for very small datasets (<10 images).
- 2000–3000 steps for typical character/style LoRAs.
- Learning rate: around 1e-4 as a default; lower if you see overfitting or detail artifacts, slightly higher if underfitting.
- Rank: start with a moderate LoRA rank (e.g., 8–16) and only increase if the style is not being captured sufficiently.
- Batch size: adjust to available VRAM; users with 12 GB report managing LoRA training at reasonable speeds with modest batch sizes.
- Prompt structuring advice:
- Always include the unique LoRA trigger token plus the desired subject and style modifiers, e.g., “[trigger] portrait, 35mm photo, soft studio lighting, high detail, 8k.”
- Put the concept token early in the prompt, followed by global style and composition, and leave more subtle modifiers (textures, accessories) later.
- If generations drift away from the trained style, increase emphasis on the trigger token (e.g., repeating it or using emphasis syntax where supported) or reduce competing style terms.
- Achieving specific results:
- Consistent character across poses:
- Ensure training data includes multiple angles, facial expressions, and outfits.
- At inference, specify pose and camera angle clearly; community examples show strong pose control when prompts are explicit.
- Strong artistic style transfer:
- Use style-focused training mode when available and ensure captions describe the style attributes (painterly, color palette, brushwork, medium).
- During inference, combine the trigger token with generic content requests (e.g., “a city street at night in [trigger] style”) to test generalization.
- Background and composition control:
- During training, include varied backgrounds to avoid hard-coding one environment.
- At inference, explicitly request the desired environment (“in a forest,” “studio backdrop,” “cinematic wide shot”).
- Iterative refinement strategies:
- Sample every 200–300 steps during training and visually inspect outputs to find the “sweet spot” before overfitting.
- If faces or hands degrade late in training, roll back to an earlier checkpoint where likeness was good but details were cleaner.
- For difficult subjects, run a shorter first training to gauge behavior (e.g., 800–1200 steps) and then extend or restart with adjusted parameters.
- Advanced techniques:
- Mixed-mode datasets:
- Combine a core identity/stylistic set with a few “negative” or neutral images and captions to keep the LoRA from hard-coding unwanted features.
- Multi-concept LoRA:
- Some users report success training multiple closely related concepts into one LoRA (e.g., same character across outfits), but this requires very clear captioning to distinguish sub-concepts.
- Turbo-aware fine-tuning:
- Use Turbo-aware settings (step-aware LR, safe ranks) where supported so that the LoRA remains stable at 1–8 inference steps.
Capabilities
- High-quality style and identity capture:
- Can encode fine-grained artistic styles, brand looks, and character identities while leveraging the strong base Z-Image prior.
- Efficient training on consumer GPUs:
- 6B S3-DiT with Turbo distillation and LoRA adaptation allows users on ~12 GB GPUs to train in 1–2 hours for typical step counts.
- Fast, low-step inference:
- Turbo-compatible LoRAs preserve high visual quality at 6–8 steps and usable results even at fewer steps, enabling rapid iteration.
- Flexible concept control:
- Supports content-focused, style-focused, or balanced training to bias the LoRA toward identity preservation or stylistic transformation.
- Strong generalization when trained correctly:
- When datasets are diverse and captions are clean, users report that trained styles and characters transfer to new scenes, poses, and compositions well.
- Lightweight deployment:
- LoRA weights are small compared to full model fine-tunes, making them easy to store, version, and swap in workflows.
- Good prompt responsiveness:
- Z-Image’s single-stream transformer and guidance-distilled Turbo design respond well to detailed textual prompts, enabling fine control over lighting, camera, and composition layered on top of the LoRA.
What Can I Use It For?
- Professional applications (from public guides, blog-like resources, and user reports):
- Brand-consistent marketing and product imagery, where a LoRA encodes a company’s visual language, color palette, or mascot so all generated assets match a unified look.
- Editorial illustration and concept art pipelines that need rapid iterations of a consistent style across multiple scenes and stories.
- Production workflows where a recurring character or IP must appear consistently in many images without manually re-designing each shot.
- Creative projects (community examples and tutorials):
- Personal character LoRAs for comics, web novels, and RPG content, enabling consistent protagonists across hundreds of panels or scenes.
- Fine-tuned artistic styles mimicking particular aesthetics (e.g., watercolor, cyberpunk neon, ink sketch) trained from small, curated datasets.
- Fan art pipelines where users train LoRAs on their own drawings or photos to generate new compositions in their personal style.
- Business and industry use cases:
- E-commerce and catalog imagery with a consistent stylistic treatment across categories (e.g., same lighting/angle/grade for product shots), using LoRAs to encode the “house style.”
- Previsualization and mood boards for advertising, film, and game design, where teams need to align quickly on a specific visual direction.
- Lightweight domain adaptation, such as making Z-Image more reliable on certain verticals (e.g., medical diagrams, architectural interiors) when provided with small domain datasets.
- Open-source and developer projects (GitHub, community tools):
- Integrated ComfyUI and similar node-based workflows where developers wire Z-Image-Turbo with interchangeable LoRA nodes for style packs and character libraries.
- Automation scripts that batch-generate large numbers of images for datasets, social media posts, or prototyping once a LoRA has locked in the desired style.
- Industry-specific examples from discussions:
- Publishing and cover design, where a LoRA ensures that series covers maintain a recognizable shared aesthetic.
- Game asset ideation, with LoRAs encoding faction-specific looks, armor sets, or environmental styles.
- Fashion and apparel mockups where a LoRA captures a brand’s photography style and applies it to new garments or models.
Things to Be Aware Of
- Experimental and model-specific behavior:
- Z-Image’s single-stream architecture and Turbo distillation lead to faster learning but also a tendency to overfit if steps and ranks are set too high, especially on small datasets.
- Turbo-aware fine-tuning settings are specialized; using generic LoRA hyperparameters from other diffusion models without adjustment can yield suboptimal results.
- Known quirks and edge cases (from community feedback):
- Users report that very small datasets (<8–10 images) can cause the LoRA to memorize exact poses and backgrounds, reducing variety in outputs.
- If captions mix multiple concepts without clear structure, the LoRA may entangle them, producing inconsistent or “blended” results.
- In some cases, very intense or niche art styles can dominate the base model’s prior so strongly that generic prompts still carry residual style traits unless the LoRA weight is reduced at inference.
- Performance and resource considerations:
- Training speed depends heavily on GPU VRAM and memory bandwidth; community reports on 12 GB cards show ~2–3 seconds per step, but older or smaller GPUs can be slower.
- High-rank LoRAs and large batch sizes can push memory usage close to VRAM limits; careful tuning of batch size and gradient accumulation is recommended.
- Consistency and stability factors:
- To maintain consistent character identity, it is important to avoid noisy or off-model training images; a few bad samples can noticeably degrade likeness.
- Some users note that as training proceeds past an optimal point, backgrounds and composition become repetitive and the model loses diversity; early checkpoint selection mitigates this.
- Positive feedback themes:
- Many users highlight how fast Z-Image-based LoRA training converges and how few images are required for convincing character or style capture compared to older diffusion backbones.
- The combination of a 6B model and Turbo distillation is frequently praised for feeling “lightweight but powerful,” suitable for consumer hardware.
- LoRAs trained with this workflow tend to integrate smoothly into existing pipelines (e.g., node-based UIs and automated scripts) due to standard weight formats.
- Common concerns or negative feedback:
- Overfitting and loss of diversity when users push step counts or ranks too high relative to dataset size.
- Sensitivity to caption quality; incorrect or overly complex captions are a recurring source of unexpected behavior.
- Occasional artifacts in fine details (e.g., hands, small text) when very low inference step counts are used, especially if training did not use Turbo-aware settings.
Limitations
- Primary technical constraints:
- The trainer is tightly coupled to the Z-Image architecture (especially Z-Image-Turbo); LoRAs are not generally portable to other diffusion models, limiting cross-model reuse.
- LoRA-based fine-tuning cannot fundamentally change the base model’s capabilities or biases; it adjusts style and content priors but remains bounded by Z-Image’s underlying distribution.
- Main scenarios where it may not be optimal:
- Very large-scale domain adaptation or tasks requiring deep architectural changes (e.g., highly specialized scientific imaging) may be better served by full model fine-tuning rather than LoRA adapters.
- Extremely small or noisy datasets, or tasks demanding perfect photorealism in edge cases (e.g., hands with complex interactions, fine typography) may expose limitations of both the base model and LoRA training, especially at low inference step counts.
Pricing
Pricing Type: Dynamic
Charge $0.00226 per steps generation
Pricing Rules
| Parameter | Rule Type | Base Price |
|---|---|---|
| steps | Per Unit Example: steps: 1000 × $0.00226 = $2.26 | $0.00226 |
Related AI Models
You can seamlessly integrate advanced AI capabilities into your applications without the hassle of managing complex infrastructure.
