How to Control Image Generation with ControlNet and IP-Adapter

The Problem with Text Prompts Alone

Text-to-image models are great until you need a specific pose, a particular layout, or a consistent style. You can write a 200-word prompt describing how a character should stand, and the model will still do whatever it wants. ControlNet fixes this by feeding spatial conditioning – edges, depth maps, skeleton poses – directly into the diffusion process. IP-Adapter takes a different angle: it injects the style of a reference image so your outputs match a visual identity without you describing every aesthetic detail in words.

Here is the install command to get everything you need:

1
pip install diffusers transformers accelerate torch controlnet-aux pillow

ControlNet with Edge Detection (Canny)

Canny edge conditioning is the easiest starting point and arguably the most reliable ControlNet mode. You extract edges from a reference image, then the model generates new content that follows those edges exactly.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import torch
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
from diffusers.utils import load_image
from controlnet_aux import CannyDetector

# Load the Canny ControlNet
controlnet = ControlNetModel.from_pretrained(
    "lllyasviel/sd-controlnet-canny",
    torch_dtype=torch.float16,
)

pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    controlnet=controlnet,
    torch_dtype=torch.float16,
)
pipe.enable_model_cpu_offload()

# Extract edges from your reference image
reference = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/hf-logo.png")
canny = CannyDetector()
edge_image = canny(reference, low_threshold=100, high_threshold=200)

# Generate with edge conditioning
result = pipe(
    prompt="a glowing neon sign, cyberpunk alley, dark background",
    negative_prompt="blurry, low quality, deformed",
    image=edge_image,
    num_inference_steps=30,
    controlnet_conditioning_scale=0.8,
).images[0]

result.save("controlnet_canny_output.png")

The controlnet_conditioning_scale parameter controls how strictly the model follows the edge map. At 1.0, it follows edges rigidly. At 0.5, it treats them more like suggestions. Start at 0.8 and adjust from there – going too high makes the output look traced rather than generated.

Depth and Pose Conditioning

Edge detection works well for hard outlines, but depth maps give you spatial layout control and pose skeletons handle human body positioning.

For depth conditioning, swap in the depth ControlNet and use a depth estimator:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
from controlnet_aux import MidasDetector

controlnet_depth = ControlNetModel.from_pretrained(
    "lllyasviel/sd-controlnet-depth",
    torch_dtype=torch.float16,
)

# Extract depth map
midas = MidasDetector.from_pretrained("lllyasviel/Annotators")
depth_image = midas(reference)

For pose conditioning with OpenPose:

1
2
3
4
5
6
7
8
9
from controlnet_aux import OpenposeDetector

controlnet_pose = ControlNetModel.from_pretrained(
    "lllyasviel/sd-controlnet-openpose",
    torch_dtype=torch.float16,
)

openpose = OpenposeDetector.from_pretrained("lllyasviel/Annotators")
pose_image = openpose(reference)

My recommendation: use Canny for architectural or product shots, depth for scene composition, and OpenPose when you need people in specific positions. Depth conditioning is the most forgiving of the three – it gives good results even with imperfect depth maps.

Combining Multiple ControlNets

When one conditioning signal is not enough, you can stack them. This is where MultiControlNetModel comes in. You might want both pose and depth at the same time – the pose keeps the character in the right position while the depth maintains the scene layout.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel, MultiControlNetModel

controlnet_canny = ControlNetModel.from_pretrained(
    "lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16
)
controlnet_depth = ControlNetModel.from_pretrained(
    "lllyasviel/sd-controlnet-depth", torch_dtype=torch.float16
)

# Combine into a MultiControlNetModel
multi_controlnet = MultiControlNetModel(controlnets=[controlnet_canny, controlnet_depth])

pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    controlnet=multi_controlnet,
    torch_dtype=torch.float16,
)
pipe.enable_model_cpu_offload()

# Pass a list of conditioning images and scales
result = pipe(
    prompt="a futuristic building in a desert landscape, golden hour lighting",
    negative_prompt="blurry, low quality",
    image=[canny_image, depth_image],  # one per ControlNet
    controlnet_conditioning_scale=[0.7, 0.5],  # per-ControlNet weights
    num_inference_steps=30,
).images[0]

result.save("multi_controlnet_output.png")

Keep the combined conditioning scales moderate. If both are at 1.0, they fight each other and the output turns into a mess. A good starting point is 0.7 for your primary signal and 0.4-0.5 for the secondary one.

IP-Adapter for Style Transfer

IP-Adapter works differently from ControlNet. Instead of spatial conditioning, it encodes a reference image into the same embedding space as text prompts. The model then generates images that match the visual style – color palette, texture, artistic feel – of your reference.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
from diffusers import StableDiffusionPipeline
from diffusers.utils import load_image

pipe = StableDiffusionPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
)
pipe.enable_model_cpu_offload()

# Load IP-Adapter weights
pipe.load_ip_adapter(
    "h94/IP-Adapter",
    subfolder="models",
    weight_name="ip-adapter_sd15.bin",
)

# Set the influence of the reference image
pipe.set_ip_adapter_scale(0.6)

# Load a style reference image
style_ref = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/ip_adapter_diner.png")

result = pipe(
    prompt="a mountain landscape at sunrise",
    ip_adapter_image=style_ref,
    negative_prompt="blurry, low quality, deformed",
    num_inference_steps=30,
    guidance_scale=7.5,
).images[0]

result.save("ip_adapter_style.png")

The set_ip_adapter_scale call is critical. At 0.6, you get a noticeable style influence while the text prompt still controls the content. Push it to 0.9+ and the reference image dominates – your prompt barely matters. For most use cases, 0.5-0.7 is the sweet spot.

Combining ControlNet and IP-Adapter

This is where things get powerful. ControlNet handles the spatial structure and IP-Adapter handles the aesthetic. You get precise layout control with consistent visual style.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
from diffusers.utils import load_image
from controlnet_aux import CannyDetector

controlnet = ControlNetModel.from_pretrained(
    "lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16
)

pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    controlnet=controlnet,
    torch_dtype=torch.float16,
)
pipe.enable_model_cpu_offload()

# Load IP-Adapter on top of the ControlNet pipeline
pipe.load_ip_adapter(
    "h94/IP-Adapter",
    subfolder="models",
    weight_name="ip-adapter_sd15.bin",
)
pipe.set_ip_adapter_scale(0.5)

# Prepare conditioning
canny = CannyDetector()
structure_image = load_image("your_reference.png")
edge_image = canny(structure_image, low_threshold=100, high_threshold=200)
style_image = load_image("your_style_reference.png")

result = pipe(
    prompt="a cozy coffee shop interior, warm lighting",
    negative_prompt="blurry, low quality",
    image=edge_image,
    ip_adapter_image=style_image,
    controlnet_conditioning_scale=0.7,
    num_inference_steps=30,
    guidance_scale=7.5,
).images[0]

result.save("controlnet_plus_ip_adapter.png")

When combining both, lower each scale slightly from what you would use individually. ControlNet at 0.7 and IP-Adapter at 0.5 is a solid default. The text prompt acts as a tiebreaker when the two conditioning signals disagree.

Tuning Conditioning Scales

Getting the right balance between text prompt, ControlNet, and IP-Adapter is the real skill. Here are concrete guidelines:

ControlNet scale 0.3-0.5: Loose guidance. The model follows the general shape but takes creative liberties. Good for artistic outputs.
ControlNet scale 0.7-0.8: Strong guidance. The output closely matches the spatial conditioning. Best for architectural or product work.
ControlNet scale 1.0+: Rigid. The model traces the conditioning image almost exactly. Rarely what you want unless you need pixel-accurate structure.
IP-Adapter scale 0.3-0.4: Subtle style hints. Colors and mood shift but content is prompt-driven.
IP-Adapter scale 0.5-0.7: Clear style transfer. The output looks like it belongs in the same visual universe as the reference.
IP-Adapter scale 0.8+: The reference image takes over. The prompt becomes mostly irrelevant.

When using both, their scales should sum to roughly 1.0-1.3. Go higher and the model has no room for the text prompt. Go lower and neither conditioning signal has enough influence to matter.

Common Errors

RuntimeError: Expected all tensors to be on the same device This happens when the ControlNet model loads on CPU while the pipeline is on GPU. Make sure you call pipe.enable_model_cpu_offload() instead of manually moving tensors with .to("cuda"). The offload method handles device placement automatically.

ValueError: Expected image to have 3 channels but got 1 Your conditioning image is grayscale but the pipeline expects RGB. Convert it before passing:

1
edge_image = edge_image.convert("RGB")

OutOfMemoryError: CUDA out of memory MultiControlNet with IP-Adapter is memory-hungry. On a 12GB GPU, enable attention slicing and VAE tiling:

1
2
pipe.enable_attention_slicing()
pipe.enable_vae_tiling()

If that is still not enough, drop to a single ControlNet or use torch.float16 everywhere (which you should be doing already).

KeyError: 'image_proj' when loading IP-Adapter You are using a weight file that does not match your base model. The ip-adapter_sd15.bin file only works with SD 1.5 pipelines. For SDXL, use ip-adapter_sdxl.bin from the sdxl_models subfolder.

Conditioning image has no visible effect on output The conditioning scale is too low, or the image resolution does not match the pipeline’s expected input. ControlNet for SD 1.5 expects 512x512 images. Resize your conditioning image to match:

1
edge_image = edge_image.resize((512, 512))

The Problem with Text Prompts Alone#

ControlNet with Edge Detection (Canny)#

Depth and Pose Conditioning#

Combining Multiple ControlNets#

IP-Adapter for Style Transfer#

Combining ControlNet and IP-Adapter#

Tuning Conditioning Scales#

Common Errors#

Related Guides#

About the Author