How to Build AI Interior Design Rendering with ControlNet and Stable Diffusion

Professional interior design visualization normally costs thousands of dollars per room – you model it in SketchUp or Blender, set up materials, tweak lighting for hours, then render overnight. ControlNet with Stable Diffusion flips this workflow. You take a phone photo of a room, extract its depth structure, and re-render it in any style you want: mid-century modern, minimalist Scandinavian, industrial loft, Japanese wabi-sabi. The spatial layout stays intact. Only the surfaces, furniture style, and mood change.

Here’s what you need installed:

1
pip install diffusers transformers accelerate torch pillow numpy

Extracting Depth Maps from Room Photos

The depth map is the backbone of this entire workflow. It captures the spatial geometry of your room – where walls are, how far the ceiling sits, where furniture breaks the plane – without any texture or color information. This gives Stable Diffusion a structural skeleton to paint over.

Depth Anything V2 is the best option for this. It runs fast, handles indoor scenes well, and produces clean edges around furniture and doorframes. MiDaS works too, but Depth Anything V2 gives noticeably sharper boundaries on interior shots.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
from transformers import pipeline
from PIL import Image

# Load the depth estimation model
depth_estimator = pipeline(
    task="depth-estimation",
    model="depth-anything/Depth-Anything-V2-Small-hf",
)

# Load your room photo
room_photo = Image.open("living_room.jpg")

# Extract the depth map
depth_result = depth_estimator(room_photo)
depth_map = depth_result["depth"]

# Save for inspection -- brighter pixels are closer to the camera
depth_map.save("room_depth_map.png")
print(f"Depth map size: {depth_map.size}")

Always save and inspect the depth map before feeding it into ControlNet. If the depth map looks muddy – walls blending into floors, furniture edges lost – the render will inherit those issues. For rooms with reflective surfaces (mirrors, glass tables), the depth estimator can get confused. A quick fix is to apply a bilateral filter to smooth noise while preserving edges:

1
2
3
4
5
6
7
8
import numpy as np
from PIL import Image, ImageFilter

depth_map = Image.open("room_depth_map.png")

# Bilateral-style smoothing: median filter preserves edges better than gaussian
depth_smoothed = depth_map.filter(ImageFilter.MedianFilter(size=5))
depth_smoothed.save("room_depth_smoothed.png")

For best results, resize your room photo to 512x512 (or 768x768 if your GPU can handle it) before depth estimation. The ControlNet pipeline expects the conditioning image to match the output resolution, and extreme rescaling introduces artifacts.

Generating Interior Renders with ControlNet

With the depth map in hand, you wire it into a ControlNet-conditioned Stable Diffusion pipeline. The depth ControlNet model tells the diffusion process “these pixels should be at this distance” while the text prompt controls the style, materials, and mood.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import torch
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel, UniPCMultistepScheduler
from PIL import Image

# Load the depth ControlNet
controlnet = ControlNetModel.from_pretrained(
    "lllyasviel/sd-controlnet-depth",
    torch_dtype=torch.float16,
)

# Build the pipeline with the ControlNet attached
pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    controlnet=controlnet,
    torch_dtype=torch.float16,
)

# UniPC scheduler is faster than the default -- fewer steps for the same quality
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()

# Load the depth map extracted earlier
depth_image = Image.open("room_depth_smoothed.png").convert("RGB").resize((512, 512))

# Render a mid-century modern living room
result = pipe(
    prompt="mid-century modern living room, walnut furniture, mustard yellow accent chair, "
           "terrazzo floor, large windows with natural light, potted monstera plant, "
           "warm afternoon light, interior design photography, 8k, sharp detail",
    negative_prompt="blurry, cartoon, sketch, low quality, deformed, watermark, dark, cluttered",
    image=depth_image,
    num_inference_steps=25,
    controlnet_conditioning_scale=0.75,
    guidance_scale=7.5,
).images[0]

result.save("render_midcentury_modern.png")

The controlnet_conditioning_scale at 0.75 is my recommended starting point for interior design. Going higher (0.9+) forces the model to follow the depth map so rigidly that furniture shapes look traced rather than naturally rendered. Going too low (below 0.5) and the room layout starts drifting – walls move, furniture floats.

The UniPC scheduler cuts inference steps from 30 to 25 without visible quality loss. On a 3060 or better, each render takes about 4-6 seconds.

Style Variations and Batch Rendering

The real power of this approach is generating multiple style options from the same room. One depth map, five completely different interiors. Here’s how to batch that:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
import torch
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel, UniPCMultistepScheduler
from PIL import Image

controlnet = ControlNetModel.from_pretrained(
    "lllyasviel/sd-controlnet-depth",
    torch_dtype=torch.float16,
)

pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "stable-diffusion-v1-5/stable-diffusion-v1-5",
    controlnet=controlnet,
    torch_dtype=torch.float16,
)
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()

depth_image = Image.open("room_depth_smoothed.png").convert("RGB").resize((512, 512))

styles = [
    {
        "name": "minimalist_scandinavian",
        "prompt": "minimalist Scandinavian living room, white oak floors, light gray linen sofa, "
                  "white walls, single pendant lamp, wool throw blanket, diffused natural light, "
                  "interior design magazine photo, clean lines, 8k",
        "seed": 42,
    },
    {
        "name": "industrial_loft",
        "prompt": "industrial loft living room, exposed brick walls, steel beam ceiling, "
                  "dark leather chesterfield sofa, Edison bulb lighting, concrete floor, "
                  "reclaimed wood coffee table, moody warm lighting, architectural photography, 8k",
        "seed": 77,
    },
    {
        "name": "japanese_minimalist",
        "prompt": "Japanese minimalist living room, tatami flooring, low wooden platform furniture, "
                  "shoji screen panels, bonsai on shelf, neutral earth tones, indirect warm lighting, "
                  "wabi-sabi aesthetic, interior photography, 8k",
        "seed": 123,
    },
    {
        "name": "art_deco",
        "prompt": "art deco living room, velvet emerald sofa, geometric gold accents, "
                  "marble side table, sunburst mirror, lacquered dark wood, dramatic chandelier, "
                  "rich jewel tones, luxury interior photography, 8k",
        "seed": 256,
    },
    {
        "name": "coastal_modern",
        "prompt": "coastal modern living room, white shiplap walls, light blue linen cushions, "
                  "rattan accent chair, driftwood decor, sandy beige rug, bright ocean light, "
                  "airy open space, beach house interior photography, 8k",
        "seed": 501,
    },
]

negative = "blurry, cartoon, sketch, low quality, deformed, watermark, dark, oversaturated, plastic"

for style in styles:
    generator = torch.Generator(device="cpu").manual_seed(style["seed"])

    result = pipe(
        prompt=style["prompt"],
        negative_prompt=negative,
        image=depth_image,
        num_inference_steps=25,
        controlnet_conditioning_scale=0.75,
        guidance_scale=7.5,
        generator=generator,
    ).images[0]

    filename = f"render_{style['name']}.png"
    result.save(filename)
    print(f"Saved {filename}")

Each seed gives a different “take” on the same style prompt. If you like the general direction of one render but want variations, keep the prompt and change only the seed. The depth conditioning ensures the room structure stays consistent across all outputs – walls, windows, and major furniture placement remain anchored.

Descriptive filenames matter when you’re showing options to a client or team. Naming files render_01.png through render_05.png is useless two days later.

Common Errors and Fixes

OutOfMemoryError: CUDA out of memory – Interior renders at 512x512 are manageable on 8GB GPUs, but 768x768 can push past the limit. First, make sure you’re using torch.float16 everywhere. Then enable attention slicing and VAE tiling:

1
2
pipe.enable_attention_slicing()
pipe.enable_vae_tiling()

If you’re still hitting OOM, switch from .to("cuda") to pipe.enable_model_cpu_offload(). It’s slower but keeps peak VRAM under 6GB.

Depth map quality is poor (muddy edges, missing furniture) – This usually happens with low-light room photos or photos with heavy shadows. Preprocess the room photo by boosting brightness and contrast before running depth estimation. The bilateral/median filter trick from earlier also helps clean up noisy depth maps.

ControlNet conditioning scale too high – output looks flat – When the scale is above 0.9, the model spends all its capacity matching the depth structure and has nothing left for realistic textures. Drop to 0.7-0.8 for interiors. You want the model to respect the room layout, not trace it pixel by pixel.

ControlNet conditioning scale too low – room layout is wrong – Below 0.4, the depth map barely influences generation. Walls shift, furniture ends up in wrong places. For interior design work, never go below 0.6.

Wrong image dimensions cause artifacts – The depth image must be resized to match the pipeline’s target output size. If you’re generating 512x512 but your depth map is 1920x1080, the automatic rescaling can smear depth information. Always explicitly resize:

1
depth_image = depth_image.resize((512, 512), Image.LANCZOS)

Colors are washed out or oversaturated – Adjust guidance_scale. At 7.5 you get balanced results. Below 5.0 the output gets muted and generic. Above 10.0 colors start clipping and the image looks like an HDR photo from 2012. For interiors, 7.0-8.0 is the sweet spot.

Extracting Depth Maps from Room Photos#

Generating Interior Renders with ControlNet#

Style Variations and Batch Rendering#

Common Errors and Fixes#

Related Guides#

About the Author

Extracting Depth Maps from Room Photos

Generating Interior Renders with ControlNet

Style Variations and Batch Rendering

Common Errors and Fixes

Related Guides