The Quick Version

FLUX.2 is Black Forest Labs’ latest image generation family, released January 2026. The Klein 4B model runs on a single consumer GPU, generates images in under a second, and ships under the Apache 2.0 license. Here’s the minimum code to get an image out of it.

1
pip install torch diffusers transformers accelerate sentencepiece protobuf
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
import torch
from diffusers import Flux2Pipeline

pipe = Flux2Pipeline.from_pretrained(
    "black-forest-labs/FLUX.2-klein-4B",
    torch_dtype=torch.bfloat16,
)
pipe.to("cuda")

image = pipe(
    prompt="a weathered lighthouse on a rocky cliff at golden hour, photorealistic, 35mm film",
    num_inference_steps=4,
    guidance_scale=1.0,
    height=1024,
    width=1024,
).images[0]

image.save("lighthouse.png")

That’s it. First run downloads about 8GB of weights from Hugging Face, then it loads from cache. On an RTX 4090, Klein 4B produces a 1024x1024 image in roughly 0.5 seconds.

Pick the Right Model

FLUX.2 ships in several variants, and choosing the wrong one wastes time or money.

ModelSizeLicenseVRAMSpeedBest For
FLUX.2-klein-4B4B paramsApache 2.0~8GBSub-secondReal-time apps, iteration
FLUX.2-klein-9B9B paramsNon-commercial~16GBSub-secondHigher quality, personal use
FLUX.2-dev32B paramsNon-commercial80GB+ (or ~20GB quantized)~10sMaximum quality, research
FLUX.2-pro / FLUX.2-maxAPI onlyCommercialN/AAPI latencyProduction, commercial use

Klein models are distilled – they use fixed inference steps (4) and guidance scale (1.0). You cannot change these parameters. If you need to tune those values, use the base variants: FLUX.2-klein-base-4B or FLUX.2-klein-base-9B, which support adjustable steps (typically 50) and guidance scale (typically 4.0).

Use the Base Model for More Control

The distilled Klein models are fast but rigid. The base variants trade speed for full control over the generation process.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
import torch
from diffusers import Flux2Pipeline

pipe = Flux2Pipeline.from_pretrained(
    "black-forest-labs/FLUX.2-klein-base-4B",
    torch_dtype=torch.bfloat16,
)
pipe.to("cuda")

image = pipe(
    prompt="macro photograph of a mechanical watch movement, extreme detail, studio lighting",
    num_inference_steps=50,      # adjustable with base models
    guidance_scale=4.0,          # adjustable with base models
    height=1024,
    width=1024,
    generator=torch.Generator("cuda").manual_seed(42),  # reproducible output
).images[0]

image.save("watch_macro.png")

Higher num_inference_steps means more denoising passes, which generally improves fine detail. guidance_scale controls prompt adherence – higher values follow the text more literally but can look overcooked past 6 or 7.

Run the Dev 32B Model on a Consumer GPU

The FLUX.2-dev model produces the best results, but at 32B parameters it needs over 80GB VRAM in full precision. You can run it on a 24GB card using 4-bit quantization and CPU offloading.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
import torch
from diffusers import Flux2Pipeline

# Load the pre-quantized 4-bit model
pipe = Flux2Pipeline.from_pretrained(
    "diffusers/FLUX.2-dev-bnb-4bit",
    torch_dtype=torch.bfloat16,
)
pipe.enable_model_cpu_offload()

image = pipe(
    prompt="a hermit crab using a vintage camera as its shell, photorealistic, shallow depth of field",
    num_inference_steps=28,      # 28 is a good speed-quality tradeoff
    guidance_scale=4.0,
    height=1024,
    width=1024,
    generator=torch.Generator("cuda").manual_seed(42),
).images[0]

image.save("hermit_crab.png")

enable_model_cpu_offload() moves each pipeline component to the GPU only when it’s needed, then back to CPU. This cuts VRAM usage to roughly 20GB at the cost of slower generation – expect around 30-60 seconds per image instead of 10.

If you have even less VRAM (8GB), use group offloading with the remote text encoder. This moves text encoding to Hugging Face’s servers so your GPU only handles the transformer:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import torch
from diffusers import Flux2Pipeline, Flux2Transformer2DModel

repo_id = "diffusers/FLUX.2-dev-bnb-4bit"

transformer = Flux2Transformer2DModel.from_pretrained(
    repo_id, subfolder="transformer",
    torch_dtype=torch.bfloat16, device_map="cpu",
)

pipe = Flux2Pipeline.from_pretrained(
    repo_id,
    text_encoder=None,   # skip local text encoder
    transformer=transformer,
    torch_dtype=torch.bfloat16,
)

pipe.transformer.enable_group_offload(
    onload_device="cuda",
    offload_device="cpu",
    offload_type="leaf_level",
    use_stream=True,
)
pipe.to("cuda")

This approach needs just 8GB of VRAM but requires 32GB of system RAM.

Edit Images with References

One of FLUX.2’s standout features is unified image editing. The same pipeline handles text-to-image, single-reference editing, and multi-reference composition – pass images in and the model incorporates them.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import torch
from diffusers import Flux2Pipeline
from diffusers.utils import load_image

pipe = Flux2Pipeline.from_pretrained(
    "black-forest-labs/FLUX.2-klein-9B",
    torch_dtype=torch.bfloat16,
)
pipe.to("cuda")

# Load reference images
photo = load_image("portrait.jpg")
style_ref = load_image("oil_painting_style.jpg")

image = pipe(
    prompt="the person from image 1 painted in the style of image 2, impressionist brushstrokes, warm palette",
    image=[photo, style_ref],
    num_inference_steps=4,
    guidance_scale=1.0,
    width=1024,
    height=1024,
).images[0]

image.save("styled_portrait.png")

FLUX.2 supports up to 10 reference images simultaneously. Reference them in the prompt as “image 1”, “image 2”, etc., in the order you pass them to the image parameter.

Reproducible Seeds and Batch Generation

For workflows where consistency matters, lock down the random seed and generate variations systematically.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
import torch
from diffusers import Flux2Pipeline

pipe = Flux2Pipeline.from_pretrained(
    "black-forest-labs/FLUX.2-klein-4B",
    torch_dtype=torch.bfloat16,
)
pipe.to("cuda")

prompt = "a moss-covered robot sitting on a park bench reading a newspaper, golden hour"

for seed in [42, 137, 256, 512, 1024]:
    generator = torch.Generator("cuda").manual_seed(seed)
    image = pipe(
        prompt=prompt,
        num_inference_steps=4,
        guidance_scale=1.0,
        generator=generator,
    ).images[0]
    image.save(f"robot_seed_{seed}.png")
    print(f"Saved robot_seed_{seed}.png")

Same prompt, same seed, same model weights – you get the exact same image every time. This is essential for A/B testing prompts or building repeatable pipelines.

Structured JSON Prompts

FLUX.2 handles structured JSON prompts natively. This gives you precise control over scene composition, lighting, camera settings, and color palettes without cramming everything into a single sentence.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
prompt = """
{
  "scene": "Professional product photography on a marble surface",
  "subjects": [
    {
      "description": "A pair of vintage leather boots",
      "position": "Center frame, angled 30 degrees",
      "color_palette": ["worn brown", "dark brass buckles"]
    }
  ],
  "lighting": "Three-point softbox setup, warm key light from upper left",
  "camera": {
    "angle": "slightly elevated",
    "lens-mm": 85,
    "f-number": "f/2.8",
    "ISO": 100
  },
  "mood": "Clean, editorial, warm"
}
"""

image = pipe(prompt=prompt, height=1024, width=1024).images[0]

This is especially useful for product photography workflows where you need consistent framing across a catalog.

Common Errors and Fixes

torch.cuda.OutOfMemoryError: CUDA out of memory

This is the most common issue. The fix depends on which model you’re running:

  • Klein 4B (needs ~8GB): Switch to torch.bfloat16, or try the FP8 variant at black-forest-labs/FLUX.2-klein-4b-fp8
  • Klein 9B (needs ~16GB): Add pipe.enable_model_cpu_offload() before moving to CUDA
  • Dev 32B: Use the quantized diffusers/FLUX.2-dev-bnb-4bit model with CPU offloading

ImportError: cannot import name 'Flux2Pipeline' from 'diffusers'

Your diffusers version is too old. FLUX.2 support landed after the initial 0.32 release. Install from main:

1
2
pip uninstall diffusers -y
pip install git+https://github.com/huggingface/diffusers -U

OSError: You are trying to access a gated repo

The Klein 9B and Dev models are gated on Hugging Face. You need to accept the license on the model page, then authenticate:

1
2
pip install huggingface_hub
huggingface-cli login

ValueError: Could not instantiate the tokenizer

Missing the sentencepiece library. FLUX.2 uses Mistral Small 3.1 as its text encoder, which needs sentencepiece:

1
pip install sentencepiece protobuf

Distilled Klein model ignoring your num_inference_steps and guidance_scale values.

This is expected behavior, not a bug. The distilled Klein variants (FLUX.2-klein-4B, FLUX.2-klein-9B) are locked to 4 steps and guidance 1.0. If you need adjustable parameters, switch to the base variants: FLUX.2-klein-base-4B or FLUX.2-klein-base-9B.

What Hardware Do You Actually Need

Here’s a realistic breakdown based on the model you want to run:

  • FLUX.2-klein-4B: RTX 3060 12GB or better. Runs at sub-second speed on an RTX 4090. This is the sweet spot for most people.
  • FLUX.2-klein-9B: RTX 4080 16GB or better. Or use CPU offloading on a 12GB card.
  • FLUX.2-dev (quantized): RTX 4090 24GB with 4-bit quantization and CPU offloading. Expect 30-60 second generation times.
  • FLUX.2-dev (full precision): H100 80GB or equivalent. Fastest at about 10 seconds per image.

If you don’t have the hardware, the BFL API starts at $0.014 per image for Klein and scales by resolution. Third-party providers like fal.ai and Replicate also host FLUX.2 models with pay-per-image pricing.