The Core Idea

InstructPix2Pix takes an image and a text instruction – “make it winter”, “add sunglasses”, “turn the building into a castle” – and produces an edited version. No masks, no sketches, no separate conditioning images. You describe the change in plain English and the model figures out what to modify and what to leave alone.

1
pip install diffusers transformers accelerate torch pillow
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
import torch
from diffusers import StableDiffusionInstructPix2PixPipeline
from diffusers.utils import load_image

pipe = StableDiffusionInstructPix2PixPipeline.from_pretrained(
    "timbrooks/instruct-pix2pix",
    torch_dtype=torch.float16,
    safety_checker=None,
)
pipe.to("cuda")

image = load_image("https://huggingface.co/datasets/diffusers/diffusers-images-docs/resolve/main/mountain.png").resize((512, 512))

result = pipe(
    prompt="make it a sunset scene",
    image=image,
    num_inference_steps=20,
    image_guidance_scale=1.5,
    guidance_scale=7.0,
).images[0]

result.save("sunset_edit.png")

That is the entire workflow. Load the model, load an image, pass a text instruction, save the output. The model was trained on a dataset of image pairs generated by combining GPT-3 (for instructions) with Prompt-to-Prompt (for consistent edits), so it understands a wide range of natural language editing commands.

The Two Knobs That Matter

InstructPix2Pix has two guidance scales that control the edit, and understanding both is the difference between useful results and garbage output.

image_guidance_scale controls how much the output resembles the original image. Higher values keep the output closer to the input. Lower values give the model more freedom to change things.

guidance_scale controls how strongly the model follows your text instruction. Higher values push harder toward the edit. Lower values produce subtler changes.

Here is a practical way to think about it:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
# Conservative edit: subtle changes, very faithful to original
conservative = pipe(
    prompt="add a light dusting of snow",
    image=image,
    num_inference_steps=20,
    image_guidance_scale=2.0,   # high = stick close to original
    guidance_scale=5.0,         # moderate = gentle edit
).images[0]

# Aggressive edit: dramatic transformation, allows more deviation
aggressive = pipe(
    prompt="add a light dusting of snow",
    image=image,
    num_inference_steps=20,
    image_guidance_scale=1.0,   # low = allow changes
    guidance_scale=12.0,        # high = push hard on the instruction
).images[0]

A good starting point for most edits: image_guidance_scale=1.5 and guidance_scale=7.0. From there, nudge image_guidance_scale up if the edit destroys too much of the original, or push guidance_scale higher if the edit is too faint.

Generating a Parameter Sweep

When you are not sure which settings work best for a given instruction, generate a grid. This saves you from manually tweaking values one at a time.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from PIL import Image as PILImage
import itertools

image_guidance_values = [1.0, 1.5, 2.0]
guidance_values = [5.0, 7.5, 10.0]

results = {}
for ig, g in itertools.product(image_guidance_values, guidance_values):
    edited = pipe(
        prompt="turn the dog into a cat",
        image=image,
        num_inference_steps=20,
        image_guidance_scale=ig,
        guidance_scale=g,
    ).images[0]
    results[(ig, g)] = edited

# Stitch into a 3x3 grid
cell_w, cell_h = 512, 512
grid = PILImage.new("RGB", (cell_w * 3, cell_h * 3))
for idx, (ig, g) in enumerate(itertools.product(image_guidance_values, guidance_values)):
    row, col = divmod(idx, 3)
    grid.paste(results[(ig, g)], (col * cell_w, row * cell_h))

grid.save("parameter_sweep.png")

You will quickly see which combination preserves the composition you want while applying the edit convincingly.

Batch Editing Multiple Images

If you have a folder of images that all need the same edit – say, making product photos look like they were shot at golden hour – loop through them with a shared pipeline instance.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
from pathlib import Path

input_dir = Path("./input_images")
output_dir = Path("./edited_images")
output_dir.mkdir(exist_ok=True)

instruction = "make it golden hour lighting"

for img_path in sorted(input_dir.glob("*.png")):
    src = load_image(str(img_path)).resize((512, 512))
    edited = pipe(
        prompt=instruction,
        image=src,
        num_inference_steps=20,
        image_guidance_scale=1.5,
        guidance_scale=7.0,
    ).images[0]
    edited.save(output_dir / img_path.name)
    print(f"Edited: {img_path.name}")

Keep the pipeline loaded between images. Reloading the model for every file is a waste of 10-15 seconds each time.

Combining with Other Pipelines

InstructPix2Pix works well as one step in a multi-stage pipeline. A common pattern is to generate a base image with text-to-image, then refine it with instruction-based editing.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from diffusers import StableDiffusionPipeline

# Stage 1: Generate a base image
txt2img = StableDiffusionPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
)
txt2img.to("cuda")

base = txt2img(
    prompt="a cozy cabin in the woods, oil painting style",
    num_inference_steps=30,
    guidance_scale=7.5,
).images[0]

# Stage 2: Edit with InstructPix2Pix
# (reuse the pipe we loaded earlier)
edited = pipe(
    prompt="add heavy snowfall and ice on the roof",
    image=base.resize((512, 512)),
    num_inference_steps=20,
    image_guidance_scale=1.5,
    guidance_scale=7.5,
).images[0]

edited.save("cabin_winter.png")

This two-stage approach gives you more control than trying to cram everything into a single prompt. Generate the scene first, then refine specific aspects with targeted instructions.

Memory Management

The model needs about 5GB of VRAM in float16. If you are running on a GPU with limited memory, enable CPU offloading or use sequential attention:

1
2
3
4
5
6
7
8
9
# Option 1: CPU offloading (slower but uses less VRAM)
pipe.enable_model_cpu_offload()

# Option 2: Sliced attention (helps on 6GB cards)
pipe.enable_attention_slicing()

# Option 3: Both
pipe.enable_model_cpu_offload()
pipe.enable_attention_slicing()

With CPU offloading, inference takes roughly 2x longer but VRAM usage drops to under 3GB. On a laptop with a 4GB GPU, this is the difference between the script running and an out-of-memory crash.

Common Errors and Fixes

RuntimeError: CUDA out of memory – The most common issue. Switch to torch.float16 when loading the model and enable pipe.enable_model_cpu_offload(). If that is still not enough, resize your input images to 512x512 before passing them in. Larger images eat memory quadratically.

Edit does nothing or barely changes the image – Your image_guidance_scale is too high. Drop it to 1.0 or 1.2 and increase guidance_scale to 10+. The model is clinging too tightly to the original.

Edit destroys the original image completely – Opposite problem. Raise image_guidance_scale to 2.0+ and reduce guidance_scale to 5.0 or lower. You are giving the model too much freedom.

Colors look washed out or oversaturated – This happens with certain combinations of guidance values. Try adding a negative prompt to steer away from artifacts:

1
2
3
4
5
6
7
8
result = pipe(
    prompt="make it a rainy day",
    image=image,
    negative_prompt="oversaturated, cartoon, low quality, blurry",
    num_inference_steps=20,
    image_guidance_scale=1.5,
    guidance_scale=7.0,
).images[0]

ValueError: Expected image to have 3 channels – Your input image has an alpha channel (RGBA). Convert it before passing to the pipeline:

1
image = image.convert("RGB")

Model produces inconsistent results across runs – Set a manual seed for reproducibility:

1
2
3
4
5
6
7
8
9
generator = torch.Generator(device="cuda").manual_seed(42)
result = pipe(
    prompt="add a rainbow in the sky",
    image=image,
    num_inference_steps=20,
    image_guidance_scale=1.5,
    guidance_scale=7.0,
    generator=generator,
).images[0]

When to Use InstructPix2Pix vs. Inpainting

InstructPix2Pix is best for global or semi-global edits: change the weather, shift the time of day, alter the style, modify lighting. It struggles with precise spatial edits like “remove the cup from the table” because it has no mask telling it where to focus.

For targeted regional edits, inpainting is still the better tool. For “make the whole scene feel different” edits, InstructPix2Pix is faster, simpler, and usually produces more coherent results because it does not have mask boundary artifacts to deal with.

The sweet spot is combining both: use InstructPix2Pix for the mood and atmosphere, then inpainting to fix specific objects that need precision work.