each::sense is in private beta.
Eachlabs | AI Workflows for app builders
flux-2-klein-9b-base-edit

FLUX-2

Flux 2 [klein] 9B Base from Black Forest Labs supports precise image-to-image editing with natural-language instructions and hex color–based control.

Avg Run Time: 10.000s

Model Slug: flux-2-klein-9b-base-edit

Playground

Input

Output

Example Result

Preview and download your result.

Preview
Your request will cost $0.002 per megapixel for output.

API & SDK

Create a Prediction

Send a POST request to create a new prediction. This will return a prediction ID that you'll use to check the result. The request should include your model inputs and API key.

Get Prediction Result

Poll the prediction endpoint with the prediction ID until the result is ready. The API uses long-polling, so you'll need to repeatedly check until you receive a success status.

Readme

Table of Contents
Overview
Technical Specifications
Key Considerations
Tips & Tricks
Capabilities
What Can I Use It For?
Things to Be Aware Of
Limitations

Overview

FLUX.2 [klein] 9B Base is a compact image generation and editing model developed by Black Forest Labs, released in January 2026. It is a 9-billion parameter rectified flow transformer that unifies text-to-image generation, single-image editing, and multi-reference image editing capabilities in a single architecture. The model combines a 9B flow model with an 8B Qwen3 text embedder, positioning it as a flagship small model on the Pareto frontier for quality versus latency across multiple image generation and editing tasks. The 9B Base variant is an undistilled, full-capacity foundation model that preserves the complete training signal, making it ideal for fine-tuning, LoRA training, research, and custom pipelines where control and flexibility matter more than raw speed. Unlike distilled variants that use only 4 inference steps, the Base model employs 25-50 step sampling schedules, enabling maximum customization and output diversity.

The model represents a significant advancement in making high-quality image generation and editing accessible on consumer hardware. It delivers state-of-the-art quality with end-to-end inference capabilities while maintaining practical resource requirements for professional and research applications. The architecture supports natural-language instructions for precise image-to-image editing, allowing users to modify images with detailed prompts while maintaining coherence across multiple reference images. FLUX.2 [klein] 9B Base is released under the FLUX Non-Commercial License (NCL) for the 9B variants, with Apache 2.0 licensing available for the smaller 4B models.

Technical Specifications

Architecture
Rectified flow transformer with unified generation and editing pipeline
Parameters
9 billion parameters (flow model) plus 8 billion parameter Qwen3 text embedder
Resolution
Supports 1024x1024 resolution for benchmarks; capable of high-resolution image generation
Input/Output formats
Text prompts for generation, text descriptions for image editing, support for multi-reference image inputs
Inference steps
25-50 steps (configurable range) for Base model
VRAM requirements
24GB or more recommended for optimal performance on consumer GPUs
Quantization support
FP8 quantization available, providing up to 1.6x faster inference with up to 40% less VRAM; NVFP4 quantization offers up to 2.7x faster inference with up to 55% less VRAM reduction
Inference latency
Approximately 0.5 to 2 seconds end-to-end on high-end consumer hardware (RTX 5090); several seconds on standard consumer GPUs
Performance metrics
Matches or exceeds Qwen-based image models at fraction of latency and VRAM; outperforms Z-Image while supporting unified text-to-image and multi-reference editing

Key Considerations

  • The 9B Base model prioritizes maximum flexibility and output diversity over speed compared to distilled variants, making it better suited for applications requiring fine-tuning and custom control rather than real-time interactive use.
  • VRAM requirements of 24GB or more mean this variant is optimized for high-end consumer GPUs like RTX 4090 or professional hardware; users with lower VRAM should consider the 4B variants or distilled models.
  • The model demonstrates high resilience against violative inputs in complex generation and editing tasks, with safety fine-tuning and third-party evaluation completed prior to release.
  • Multi-reference image editing with the 9B Base model produces more coherent results across multiple reference images compared to smaller variants, though careful prompting and sometimes multiple renders may be necessary for optimal results.
  • The 25-50 step sampling schedule provides a configurable range, allowing users to balance between inference speed and output quality based on their specific requirements.
  • Prompt engineering for image editing should include detailed natural-language instructions specifying desired modifications; the model responds well to specific color descriptions and detailed editing directives.
  • The model requires correct text encoder configuration to avoid shape mismatch errors during inference; using the appropriate Qwen3 text embedder is critical for proper operation.

Tips & Tricks

  • For optimal quality in image editing tasks, use the full 50-step range rather than reducing to 25 steps, as this provides maximum detail and fidelity in edited outputs.
  • When performing multi-reference image editing, structure prompts to clearly specify which elements from reference images should be incorporated and how they should be blended with the base image.
  • Leverage the model's flexibility for fine-tuning by using LoRA training on domain-specific datasets to customize the model for particular artistic styles or specialized applications.
  • For color-based control in image editing, use specific hex color codes in prompts rather than vague color descriptions to achieve more precise results.
  • Experiment with different step counts within the 25-50 range to find the optimal balance for your specific use case; higher steps generally produce more refined details but increase inference time.
  • When using the model for research or custom pipelines, take advantage of the undistilled architecture by accessing intermediate representations and model internals for advanced applications.
  • Structure multi-image editing prompts hierarchically, specifying primary edits first, then secondary modifications, to improve coherence across reference images.
  • Use the model's unified architecture to perform iterative editing workflows, generating base images and then refining them through successive editing operations within a single pipeline.

Capabilities

  • Unified text-to-image generation and image editing in a single model architecture, eliminating the need for separate specialized models.
  • High-quality photorealistic image generation with exceptional output diversity, particularly in base model variants that preserve complete training signal.
  • Precise image-to-image editing with natural-language instructions, allowing detailed modifications to existing images through descriptive prompts.
  • Multi-reference image generation and editing, enabling the model to incorporate elements from multiple reference images into a single coherent output.
  • Fine-tuning and LoRA training capabilities due to the undistilled, full-capacity foundation model architecture.
  • Flexible inference configuration with adjustable step counts (25-50 range) to balance quality and speed based on application requirements.
  • Support for quantized variants (FP8 and NVFP4) that maintain quality while reducing computational requirements and VRAM usage.
  • State-of-the-art quality on the Pareto frontier for quality versus latency and VRAM efficiency compared to other compact image models.
  • Robust safety features with demonstrated high resilience against violative inputs in complex generation and editing tasks.

What Can I Use It For?

  • Professional image editing workflows where users need precise control over modifications and the ability to fine-tune the model for specific artistic styles or requirements.
  • Research applications in computer vision and generative modeling, leveraging the undistilled architecture for access to complete training signals and model internals.
  • Custom pipeline development for specialized image generation and editing tasks, with the flexibility to integrate the model into complex workflows.
  • Fine-tuning for domain-specific applications such as architectural visualization, product design mockups, or specialized artistic styles documented in technical research.
  • Multi-reference image composition projects where multiple source images need to be intelligently combined into coherent outputs.
  • Iterative creative workflows where users generate base images and progressively refine them through successive editing operations.
  • High-quality image generation for content creation where output diversity and photorealistic quality are prioritized over generation speed.
  • Educational and experimental projects exploring advanced image generation techniques and model customization.

Things to Be Aware Of

  • The 9B Base model is more VRAM-intensive than smaller variants, requiring 24GB or more of GPU memory for optimal performance; users with limited VRAM should consider the 4B variants or distilled models instead.
  • Multi-reference image editing can be inconsistent and may require multiple renders and careful prompt engineering to achieve desired coherence across reference images, particularly when attempting complex compositions.
  • The model uses 25-50 inference steps by default, resulting in longer generation times (several seconds on standard consumer hardware) compared to distilled variants that use only 4 steps; this makes it less suitable for real-time interactive applications.
  • Correct text encoder configuration is critical; using the wrong Qwen3 text embedder variant can result in shape mismatch errors that prevent inference.
  • The 9B Base model is released under the FLUX Non-Commercial License (NCL), restricting commercial use; users requiring commercial licensing should verify terms or consider alternative models.
  • Output quality and consistency can vary based on prompt specificity; vague or poorly structured prompts may result in less coherent editing results, particularly in multi-reference scenarios.
  • The model's flexibility and undistilled architecture make it more suitable for research and custom applications than for straightforward production use; users prioritizing speed and simplicity may benefit from distilled variants.
  • Fine-tuning and LoRA training require significant computational resources and expertise; these advanced capabilities are best suited for users with technical backgrounds and access to adequate hardware.

Limitations

  • The 9B Base model's 25-50 step inference requirement results in significantly longer generation times compared to distilled variants, making it less practical for real-time or interactive applications where sub-second latency is critical.
  • High VRAM requirements (24GB or more) limit accessibility to users with high-end consumer GPUs or professional hardware; this restricts deployment options compared to more efficient models.
  • Multi-reference image editing consistency remains a challenge, with the model sometimes producing incoherent results across multiple reference images even with careful prompting, requiring iterative refinement and multiple render attempts.