How to Build AI Scene Generation with Layered Diffusion

Single-shot image generation hits a wall when you need precise control over complex scenes. You prompt for “a dragon perched on a castle at sunset with fog rolling in” and the model smashes everything together, sometimes beautifully, often not. Layered generation flips the approach: generate the background, inpaint subjects into specific regions, then composite everything with alpha blending and effects. You get control over each element independently, and you can regenerate any single layer without throwing away the rest.

This pipeline uses diffusers for generation and inpainting, and PIL for compositing. You need a GPU with at least 8GB VRAM.

1
pip install diffusers transformers accelerate torch pillow

Generate the Background Layer

Start with a clean background. This is your base canvas – a landscape, interior, or environment with no main subjects yet. Keep the prompt focused on setting and atmosphere.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import torch
from diffusers import StableDiffusionPipeline

bg_pipe = StableDiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-2-1",
    torch_dtype=torch.float16,
)
bg_pipe.enable_model_cpu_offload()

prompt = "a vast alien desert landscape at golden hour, dramatic sky with two suns, rocky formations in the distance, cinematic lighting, no people, no characters"
negative_prompt = "people, characters, animals, text, watermark, blurry, low quality"

background = bg_pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    num_inference_steps=40,
    guidance_scale=8.0,
    width=768,
    height=512,
    generator=torch.Generator("cuda").manual_seed(77),
).images[0]

background.save("background_layer.png")
print(f"Background size: {background.size}")

The negative prompt matters here. Explicitly exclude subjects you plan to add later – otherwise the model might place figures in the scene that clash with your inpainted subjects. The seed keeps things reproducible while you iterate on prompts.

Create Masks for Subject Placement

Masks tell the inpainting model where to paint. White pixels (255) mark regions to regenerate. Black pixels (0) stay untouched. You can define multiple masks for different subjects placed at different positions in the scene.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
from PIL import Image, ImageDraw, ImageFilter

width, height = 768, 512

def create_subject_mask(size, regions):
    """Create a binary mask with white regions where subjects should appear."""
    mask = Image.new("L", size, 0)
    draw = ImageDraw.Draw(mask)
    for region in regions:
        shape = region.get("shape", "ellipse")
        bbox = region["bbox"]  # (left, top, right, bottom)
        if shape == "ellipse":
            draw.ellipse(bbox, fill=255)
        elif shape == "rectangle":
            draw.rectangle(bbox, fill=255)
    # Feather the edges so inpainted content blends smoothly
    mask = mask.filter(ImageFilter.GaussianBlur(radius=8))
    return mask

# Mask for a foreground character on the left
character_mask = create_subject_mask(
    (width, height),
    [{"shape": "ellipse", "bbox": (80, 120, 280, 480)}],
)
character_mask.save("character_mask.png")

# Mask for a vehicle/object on the right
vehicle_mask = create_subject_mask(
    (width, height),
    [{"shape": "rectangle", "bbox": (480, 200, 720, 460)}],
)
vehicle_mask.save("vehicle_mask.png")

The Gaussian blur on the mask edges is important. Hard edges create visible seams where inpainted content meets the original background. A blur radius of 5-10 pixels produces a smooth transition.

Inpaint Subjects into the Scene

Now paint subjects into the masked regions of your background. Each inpainting pass takes the current image plus a mask and generates new content only inside the white area.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import torch
from diffusers import AutoPipelineForInpainting
from PIL import Image

inpaint_pipe = AutoPipelineForInpainting.from_pretrained(
    "stabilityai/stable-diffusion-2-inpainting",
    torch_dtype=torch.float16,
)
inpaint_pipe.enable_model_cpu_offload()

# Load the background and character mask
background = Image.open("background_layer.png").convert("RGB")
character_mask = Image.open("character_mask.png").convert("L")

# Inpaint the character
scene_with_character = inpaint_pipe(
    prompt="a lone astronaut in a weathered spacesuit standing on rocky ground, facing away, cinematic, detailed",
    negative_prompt="blurry, deformed, extra limbs, bad anatomy, cartoon",
    image=background,
    mask_image=character_mask,
    num_inference_steps=40,
    guidance_scale=8.0,
    strength=0.95,
    generator=torch.Generator("cuda").manual_seed(42),
).images[0]

scene_with_character.save("scene_with_character.png")

# Now inpaint the vehicle onto the result
vehicle_mask = Image.open("vehicle_mask.png").convert("L")

scene_complete = inpaint_pipe(
    prompt="a rusted sci-fi hover vehicle parked on desert sand, weathered metal, cinematic lighting",
    negative_prompt="blurry, deformed, low quality, cartoon, text",
    image=scene_with_character,
    mask_image=vehicle_mask,
    num_inference_steps=40,
    guidance_scale=8.0,
    strength=0.90,
    generator=torch.Generator("cuda").manual_seed(99),
).images[0]

scene_complete.save("scene_complete.png")

The strength parameter controls how much the inpainted region deviates from the original pixels. At 1.0, the model ignores the existing content entirely. At 0.8-0.95, it uses the background as a loose guide, which helps the lighting and color palette stay consistent across layers.

You can chain as many inpainting passes as you want. Each pass adds one subject or detail. The order matters – paint background elements first, then midground, then foreground, so overlapping regions look natural.

Composite and Blend Layers

For finer control, generate layers as separate images and composite them with PIL’s alpha blending. This is useful when you want to adjust opacity per layer, add atmospheric effects, or apply depth-of-field blur to background elements.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
from PIL import Image, ImageFilter, ImageEnhance

def composite_scene(layers):
    """
    Composite a list of layers onto a base canvas.
    Each layer: {"image": PIL.Image, "position": (x, y), "opacity": float, "blur": float}
    """
    base = layers[0]["image"].copy().convert("RGBA")

    for layer in layers[1:]:
        img = layer["image"].convert("RGBA")
        opacity = layer.get("opacity", 1.0)
        blur_radius = layer.get("blur", 0)
        position = layer.get("position", (0, 0))

        # Apply depth-of-field blur
        if blur_radius > 0:
            img = img.filter(ImageFilter.GaussianBlur(radius=blur_radius))

        # Adjust opacity by modifying the alpha channel
        if opacity < 1.0:
            alpha = img.getchannel("A")
            alpha = alpha.point(lambda p: int(p * opacity))
            img.putalpha(alpha)

        # Composite onto base at the specified position
        base.paste(img, position, img)

    return base

# Load your generated layers
background = Image.open("background_layer.png")
character = Image.open("scene_with_character.png")
final_scene = Image.open("scene_complete.png")

# Generate a fog/atmosphere overlay
fog_overlay = bg_pipe(
    prompt="thin atmospheric fog, haze, volumetric light, dust particles, transparent",
    negative_prompt="solid, opaque, dark, text",
    num_inference_steps=30,
    guidance_scale=5.0,
    width=768,
    height=512,
    generator=torch.Generator("cuda").manual_seed(200),
).images[0]

# Composite with atmospheric effects
result = composite_scene([
    {"image": final_scene, "position": (0, 0), "opacity": 1.0, "blur": 0},
    {"image": fog_overlay, "position": (0, 0), "opacity": 0.3, "blur": 2},
])

# Boost contrast on the final output
enhancer = ImageEnhance.Contrast(result.convert("RGB"))
result = enhancer.enhance(1.15)

result.save("final_layered_scene.png")
print("Scene composited and saved.")

The fog overlay at 30% opacity adds atmospheric depth without overwhelming the scene. You can generate multiple effect layers – fog, lens flare, color grading overlays – and stack them at different opacities. This is the same principle used in VFX compositing, just with AI-generated elements instead of rendered passes.

For depth-of-field, blur the background more heavily and keep foreground subjects sharp. A blur radius of 3-5 on distant elements creates a convincing bokeh effect without making the image look artificially processed.

Common Errors and Fixes

RuntimeError: Expected all tensors to be on the same device

This happens when the image tensor is on CPU but the model is on GPU, or vice versa. If you are using enable_model_cpu_offload(), make sure you are not also manually calling .to("cuda") on the pipeline. Pick one approach: either pipe.to("cuda") for keeping everything on GPU, or pipe.enable_model_cpu_offload() for automatic memory management. Do not mix them.

ValueError: image and mask_image must have the same dimensions

The inpainting pipeline requires the input image and mask to have identical width and height. Resize both to the same dimensions before calling the pipeline:

1
2
3
target_size = (768, 512)
image = image.resize(target_size, Image.LANCZOS)
mask = mask.resize(target_size, Image.LANCZOS)

Also check the mode – the mask should be mode "L" (grayscale) and the image should be "RGB". Convert them explicitly with .convert("L") and .convert("RGB").

torch.cuda.OutOfMemoryError: CUDA out of memory

Layered generation runs the model multiple times, so memory pressure adds up. Three fixes in order of impact:

Use pipe.enable_model_cpu_offload() instead of .to("cuda") – this moves layers to CPU when not in use.
Add pipe.enable_attention_slicing() to reduce peak VRAM by computing attention in chunks.
Clear the CUDA cache between generation passes with torch.cuda.empty_cache().

If you are still running out of memory on an 8GB card, drop the resolution to 512x512 and upscale the final composite with a separate super-resolution pass.

Visible seams between inpainted and original regions

The mask edges are too sharp. Apply a Gaussian blur to the mask before passing it to the inpainting pipeline. A radius of 5-10 pixels smooths the transition. Also try increasing strength closer to 1.0 so the model has more freedom to blend the boundary area.

Generate the Background Layer#

Create Masks for Subject Placement#

Inpaint Subjects into the Scene#

Composite and Blend Layers#

Common Errors and Fixes#

Related Guides#

About the Author

Generate the Background Layer

Create Masks for Subject Placement

Inpaint Subjects into the Scene

Composite and Blend Layers

Common Errors and Fixes

Related Guides