Product photography is expensive. Studio time, lighting rigs, a photographer who knows what they’re doing – it adds up fast. Diffusion models can generate studio-quality product shots from a single reference image or even just a text prompt. You get full control over the background, lighting, and angle without touching a camera.

Here’s the fastest path: use Stable Diffusion inpainting to swap backgrounds, SDXL for text-to-image generation, and IP-Adapter to keep your product looking the same across multiple scenes.

Replace Product Backgrounds with Inpainting

The most practical use case is background replacement. You have a product photo on a white background and want it placed in a lifestyle scene – on a marble countertop, in a cozy kitchen, on a beach at sunset.

The workflow: segment the product, create a mask for the background, then inpaint the background with a new scene description.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import torch
from PIL import Image
from diffusers import StableDiffusionInpaintPipeline

# Load the inpainting pipeline
pipe = StableDiffusionInpaintPipeline.from_pretrained(
    "stabilityai/stable-diffusion-2-inpainting",
    torch_dtype=torch.float16,
)
pipe = pipe.to("cuda")

# Load your product image and background mask
# The mask should be white where you want to generate (background)
# and black where you want to keep (product)
product_image = Image.open("product_on_white.png").resize((512, 512))
background_mask = Image.open("background_mask.png").resize((512, 512))

prompt = (
    "product on a dark marble countertop, soft studio lighting, "
    "shallow depth of field, professional product photography, 8k"
)
negative_prompt = (
    "blurry, low quality, distorted, watermark, text, "
    "oversaturated, cartoon, illustration"
)

result = pipe(
    prompt=prompt,
    image=product_image,
    mask_image=background_mask,
    num_inference_steps=50,
    guidance_scale=7.5,
    negative_prompt=negative_prompt,
).images[0]

result.save("product_marble_scene.png")

A few things matter here. The mask quality makes or breaks the result. Use a segmentation model like SAM or rembg to generate a clean product mask. If the mask bleeds into the product edges, you’ll get artifacts where the product meets the new background.

For the prompt, be specific about lighting and surface materials. “Soft studio lighting” and “shallow depth of field” produce the most realistic product shots. Avoid vague prompts like “nice background” – the model needs concrete visual details.

Generate Product Scenes from Text with SDXL

When you don’t have a reference photo at all, SDXL can generate full product scenes from scratch. This works well for concept mockups or when you need a product that doesn’t physically exist yet.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import torch
from diffusers import StableDiffusionXLPipeline

pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    variant="fp16",
    use_safetensors=True,
)
pipe = pipe.to("cuda")

# Enable memory-efficient attention for lower VRAM usage
pipe.enable_xformers_memory_efficient_attention()

prompt = (
    "a minimalist glass perfume bottle on a white pedestal, "
    "soft pink gradient background, volumetric lighting from the left, "
    "studio product photography, sharp focus, photorealistic, 4k"
)
negative_prompt = (
    "text, watermark, logo, blurry, deformed, low resolution, "
    "cartoon, painting, illustration, oversaturated"
)

image = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    num_inference_steps=40,
    guidance_scale=7.0,
    width=1024,
    height=1024,
).images[0]

image.save("perfume_bottle_concept.png")

SDXL’s native 1024x1024 resolution is a big deal for product photography. Lower resolutions produce soft details that scream “AI generated.” At 1024x1024, you get crisp edges on glass, metal, and fabric textures.

Prompting Tips for Product Photos

Your prompt structure matters more than prompt length. Stack these elements in order:

  • Subject: “a matte black wireless earbud case”
  • Surface/placement: “on a slate tile, surrounded by eucalyptus leaves”
  • Lighting: “rim lighting from behind, soft fill light from the right”
  • Style modifiers: “commercial product photography, editorial, sharp focus”
  • Quality tokens: “8k, photorealistic, high detail”

Avoid cramming too many objects into the scene. One product, one surface, one lighting setup. The model handles simple compositions far better than cluttered ones.

Keep Products Consistent with IP-Adapter

The hardest problem in AI product photography is consistency. You generate a great shot, but the next one looks like a different product entirely. IP-Adapter solves this by conditioning the generation on a reference image, so the model preserves visual identity across scenes.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import torch
from PIL import Image
from diffusers import StableDiffusionXLPipeline

pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
    variant="fp16",
    use_safetensors=True,
)
pipe = pipe.to("cuda")

# Load the IP-Adapter weights for SDXL
pipe.load_ip_adapter(
    "h94/IP-Adapter",
    subfolder="sdxl_models",
    weight_name="ip-adapter_sdxl.bin",
)

# Set the IP-Adapter scale (0.0 to 1.0)
# Higher values stick closer to the reference image
pipe.set_ip_adapter_scale(0.6)

# Load reference product image
reference_image = Image.open("my_product_reference.png").resize((1024, 1024))

# Generate the product in a new scene
prompt = (
    "product on a wooden table in a sunlit cafe, morning light, "
    "bokeh background, commercial photography, sharp focus"
)
negative_prompt = "blurry, distorted, low quality, watermark, text"

image = pipe(
    prompt=prompt,
    ip_adapter_image=reference_image,
    negative_prompt=negative_prompt,
    num_inference_steps=40,
    guidance_scale=7.0,
    width=1024,
    height=1024,
).images[0]

image.save("product_cafe_scene.png")

The ip_adapter_scale parameter controls how strongly the reference image influences the output. At 0.6, the model balances between your text prompt (scene description) and the reference image (product appearance). Push it to 0.8 or higher if the product shape and color aren’t coming through. Drop it to 0.4 if the scene is getting ignored.

For best results, use a clean reference image with a neutral background. A product shot on pure white works better than one with a busy scene behind it. The IP-Adapter extracts visual features from the entire image, so background clutter leaks into the generation.

Batch Generation for Catalogs

When you need multiple scenes for a product catalog, loop through prompt variations and save each result with a descriptive filename:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
scenes = [
    ("marble countertop, soft overhead lighting", "marble"),
    ("wooden shelf with plants, natural window light", "shelf"),
    ("dark gradient background, dramatic side lighting", "dramatic"),
    ("beach sand with ocean blur background, golden hour", "beach"),
]

for scene_prompt, label in scenes:
    full_prompt = f"product on a {scene_prompt}, commercial photography, sharp focus, 8k"
    image = pipe(
        prompt=full_prompt,
        ip_adapter_image=reference_image,
        negative_prompt=negative_prompt,
        num_inference_steps=40,
        guidance_scale=7.0,
        width=1024,
        height=1024,
    ).images[0]
    image.save(f"product_{label}.png")

This gives you four different lifestyle shots of the same product in under two minutes on an A100.

Common Errors and Fixes

OutOfMemoryError: CUDA out of memory – SDXL at 1024x1024 needs around 12GB VRAM. Enable attention slicing with pipe.enable_attention_slicing() or use pipe.enable_model_cpu_offload() to move unused components to CPU. This trades speed for lower memory usage.

Product looks different from reference with IP-Adapter – Increase ip_adapter_scale from 0.6 to 0.8. If the product still drifts, check that your reference image has a clean background. Crop tightly around the product before passing it in.

Inpainting bleeds into the product area – Your mask isn’t clean enough. Dilate the mask by a few pixels to create a buffer zone around the product edge. In PIL: mask = mask.filter(ImageFilter.MaxFilter(5)).

Generated backgrounds look flat or pasted-on – Add lighting direction to your prompt. “Soft light from the upper left” or “rim lighting from behind” gives the model enough context to render consistent shadows and reflections that match the product.

xformers not installed error – Either install xformers with pip install xformers or replace enable_xformers_memory_efficient_attention() with pipe.enable_attention_slicing() as a fallback. Both reduce memory usage, but xformers is faster.

Inconsistent quality across batch generations – Set a fixed seed with generator=torch.Generator("cuda").manual_seed(42) for reproducible results. Then iterate on the seed until you find one that produces clean outputs for your product category.