FLUX-2
Flux 2 [klein] 4B from Black Forest Labs delivers text-to-image generation with enhanced realism, sharper text rendering, and integrated native editing tools.
Avg Run Time: 7.000s
Model Slug: flux-2-klein-4b-base-text-to-image
Playground
Input
Output
Example Result
Preview and download your result.

API & SDK
Create a Prediction
Send a POST request to create a new prediction. This will return a prediction ID that you'll use to check the result. The request should include your model inputs and API key.
Get Prediction Result
Poll the prediction endpoint with the prediction ID until the result is ready. The API uses long-polling, so you'll need to repeatedly check until you receive a success status.
Readme
Overview
FLUX.2 [klein] 4B base is a compact text-to-image generation model developed by Black Forest Labs as part of their FLUX.2 [klein] family, designed for interactive visual intelligence with sub-second inference capabilities. This 4B parameter base variant supports unified text-to-image generation, single-reference image editing, and multi-reference generation in a single architecture, delivering photorealistic outputs, high diversity, and sharp text rendering while running efficiently on consumer GPUs.
The model employs a rectified flow transformer architecture, step-distilled for the distilled variants but offering full flexibility in the base form with 25-50 step sampling schedules, paired with a Qwen3 text embedder for strong prompt understanding and spatial logic. What sets it apart is its Pareto frontier performance in quality versus latency and VRAM, matching or exceeding larger models like Qwen-based systems in Elo benchmarks for text-to-image and editing tasks, with native support for editing tools and professional-grade character consistency at low resource costs.
Licensed under Apache 2.0, the 4B base model is optimized for research, fine-tuning, and custom pipelines, enabling high-quality outputs on hardware like RTX 3090/4070 with around 13GB VRAM, making advanced image generation accessible without enterprise-level setups.
Technical Specifications
- Architecture: Rectified flow transformer with Qwen3 text embedder (qwen34b.safetensors for 4B)
- Parameters: 4B
- Resolution: Up to 1024x1024 and 4MP photorealistic images
- Input/Output formats: Text prompts for generation/editing; supports single/multi-reference images; outputs images
- Performance metrics: Base uses 25-50 steps; ~13GB VRAM on RTX 3090/4070; sub-second inference potential with distillation (0.3-1.2s on RTX 5090 for distilled); Elo benchmark leader in quality vs latency/VRAM for T2I, single/multi-reference tasks
Key Considerations
- Use the correct text encoder (e.g., qwen34b.safetensors) to avoid shape mismatch errors during inference
- Base model requires more steps (25-50) than distilled (4 steps), trading speed for flexibility and fine-tuning potential
- Optimal on GPUs with 12GB+ VRAM; quantized FP8 variants reduce VRAM by up to 40% and speed up inference
- Prompt engineering benefits from detailed descriptions for anatomy and details, as base excels in customization but may over-process at high steps
- Balance CFG scale around 5.0 for base models to maintain quality without artifacts
Tips & Tricks
- For text-to-image, start with 25-50 steps in base mode and CFG 5.0; iteratively refine by adjusting steps downward if over-processing occurs
- Structure prompts with clear subject, style, and spatial details (e.g., "photorealistic portrait of a person in a cityscape, sharp focus on face") to leverage Qwen3 encoder strengths
- Achieve editing by providing single/multi-reference images alongside text; use distilled for quick previews, base for precise control
- Quantize to FP8 or NVFP4 for 1.6-2.7x speed gains on supported hardware while preserving quality
- Refine outputs iteratively: generate base, edit with multi-reference, upscale for 4MP results; test prompts on distilled first to save time
Capabilities
- Generates photorealistic images with high diversity and sharp text rendering in a unified model
- Supports native text-to-image, single-reference editing, and multi-reference generation with strong spatial logic and character consistency
- Delivers professional-grade outputs at 1024x1024+ resolutions, matching larger models in Elo quality scores
- Runs efficiently on consumer hardware for interactive workflows, with base flexibility for custom sampling
- Excels in real-time generation potential when distilled, high fidelity in base for research applications
What Can I Use It For?
- Real-time creative workflows like rapid prototyping of visual concepts in design iterations
- Image editing tasks requiring multi-reference consistency, such as character design across poses
- Research and fine-tuning pipelines for custom text-to-image applications on low-VRAM setups
- Photorealistic content creation for experimental visuals, as shown in benchmark collages
- Interactive generation in development environments, leveraging sub-second speeds on modern GPUs
Things to Be Aware Of
- Base model provides higher customization but can produce slightly over-processed images at 50 steps compared to distilled's cleaner 4-step results
- Strong performance on RTX 5090/GB200 with 0.3-1.2s inference for distilled; base takes several seconds but fits 12-13GB VRAM comfortably
- Users report excellent speed and accessibility on 8-12GB VRAM GPUs, with professional character consistency
- Quantized variants (FP8/NVFP4) significantly reduce resource needs while maintaining frontier performance
- Community notes good single-image edits on 4B distilled, with base ideal for maximum control in fine-tuning
Limitations
- Struggles with anatomy, hands, and fine details in text-to-image compared to top production models, limiting commercial readiness for 4B variants
- Multi-reference edits can be inconsistent, often requiring multiple renders and refined prompting
- Base model's longer sampling (25-50 steps) sacrifices speed for flexibility, less ideal for real-time without distillation
Related AI Models
You can seamlessly integrate advanced AI capabilities into your applications without the hassle of managing complex infrastructure.
