The Problem with Text Prompts Alone
Text-to-image models are great until you need a specific pose, a particular layout, or a consistent style. You can write a 200-word prompt describing how a character should stand, and the model will still do whatever it wants. ControlNet fixes this by feeding spatial conditioning – edges, depth maps, skeleton poses – directly into the diffusion process. IP-Adapter takes a different angle: it injects the style of a reference image so your outputs match a visual identity without you describing every aesthetic detail in words.
Here is the install command to get everything you need:
| |
ControlNet with Edge Detection (Canny)
Canny edge conditioning is the easiest starting point and arguably the most reliable ControlNet mode. You extract edges from a reference image, then the model generates new content that follows those edges exactly.
| |
The controlnet_conditioning_scale parameter controls how strictly the model follows the edge map. At 1.0, it follows edges rigidly. At 0.5, it treats them more like suggestions. Start at 0.8 and adjust from there – going too high makes the output look traced rather than generated.
Depth and Pose Conditioning
Edge detection works well for hard outlines, but depth maps give you spatial layout control and pose skeletons handle human body positioning.
For depth conditioning, swap in the depth ControlNet and use a depth estimator:
| |
For pose conditioning with OpenPose:
| |
My recommendation: use Canny for architectural or product shots, depth for scene composition, and OpenPose when you need people in specific positions. Depth conditioning is the most forgiving of the three – it gives good results even with imperfect depth maps.
Combining Multiple ControlNets
When one conditioning signal is not enough, you can stack them. This is where MultiControlNetModel comes in. You might want both pose and depth at the same time – the pose keeps the character in the right position while the depth maintains the scene layout.
| |
Keep the combined conditioning scales moderate. If both are at 1.0, they fight each other and the output turns into a mess. A good starting point is 0.7 for your primary signal and 0.4-0.5 for the secondary one.
IP-Adapter for Style Transfer
IP-Adapter works differently from ControlNet. Instead of spatial conditioning, it encodes a reference image into the same embedding space as text prompts. The model then generates images that match the visual style – color palette, texture, artistic feel – of your reference.
| |
The set_ip_adapter_scale call is critical. At 0.6, you get a noticeable style influence while the text prompt still controls the content. Push it to 0.9+ and the reference image dominates – your prompt barely matters. For most use cases, 0.5-0.7 is the sweet spot.
Combining ControlNet and IP-Adapter
This is where things get powerful. ControlNet handles the spatial structure and IP-Adapter handles the aesthetic. You get precise layout control with consistent visual style.
| |
When combining both, lower each scale slightly from what you would use individually. ControlNet at 0.7 and IP-Adapter at 0.5 is a solid default. The text prompt acts as a tiebreaker when the two conditioning signals disagree.
Tuning Conditioning Scales
Getting the right balance between text prompt, ControlNet, and IP-Adapter is the real skill. Here are concrete guidelines:
- ControlNet scale 0.3-0.5: Loose guidance. The model follows the general shape but takes creative liberties. Good for artistic outputs.
- ControlNet scale 0.7-0.8: Strong guidance. The output closely matches the spatial conditioning. Best for architectural or product work.
- ControlNet scale 1.0+: Rigid. The model traces the conditioning image almost exactly. Rarely what you want unless you need pixel-accurate structure.
- IP-Adapter scale 0.3-0.4: Subtle style hints. Colors and mood shift but content is prompt-driven.
- IP-Adapter scale 0.5-0.7: Clear style transfer. The output looks like it belongs in the same visual universe as the reference.
- IP-Adapter scale 0.8+: The reference image takes over. The prompt becomes mostly irrelevant.
When using both, their scales should sum to roughly 1.0-1.3. Go higher and the model has no room for the text prompt. Go lower and neither conditioning signal has enough influence to matter.
Common Errors
RuntimeError: Expected all tensors to be on the same device
This happens when the ControlNet model loads on CPU while the pipeline is on GPU. Make sure you call pipe.enable_model_cpu_offload() instead of manually moving tensors with .to("cuda"). The offload method handles device placement automatically.
ValueError: Expected image to have 3 channels but got 1
Your conditioning image is grayscale but the pipeline expects RGB. Convert it before passing:
| |
OutOfMemoryError: CUDA out of memory
MultiControlNet with IP-Adapter is memory-hungry. On a 12GB GPU, enable attention slicing and VAE tiling:
| |
If that is still not enough, drop to a single ControlNet or use torch.float16 everywhere (which you should be doing already).
KeyError: 'image_proj' when loading IP-Adapter
You are using a weight file that does not match your base model. The ip-adapter_sd15.bin file only works with SD 1.5 pipelines. For SDXL, use ip-adapter_sdxl.bin from the sdxl_models subfolder.
Conditioning image has no visible effect on output The conditioning scale is too low, or the image resolution does not match the pipeline’s expected input. ControlNet for SD 1.5 expects 512x512 images. Resize your conditioning image to match:
| |
Related Guides
- How to Build AI Sketch-to-Image Generation with ControlNet Scribble
- How to Build AI Sticker and Emoji Generation with Stable Diffusion
- How to Build AI Architectural Rendering with ControlNet and Stable Diffusion
- How to Build AI Comic Strip Generation with Stable Diffusion
- How to Build AI Scene Generation with Layered Diffusion
- How to Build AI Wallpaper Generation with Stable Diffusion and Tiling
- How to Build AI Motion Graphics Generation with Deforum Stable Diffusion
- How to Build AI Seamless Pattern Generation with Stable Diffusion
- How to Remove and Replace Image Backgrounds with AI
- How to Edit Images with AI Inpainting Using Stable Diffusion