Product photography is expensive. Studio time, lighting rigs, a photographer who knows what they’re doing – it adds up fast. Diffusion models can generate studio-quality product shots from a single reference image or even just a text prompt. You get full control over the background, lighting, and angle without touching a camera.
Here’s the fastest path: use Stable Diffusion inpainting to swap backgrounds, SDXL for text-to-image generation, and IP-Adapter to keep your product looking the same across multiple scenes.
Replace Product Backgrounds with Inpainting
The most practical use case is background replacement. You have a product photo on a white background and want it placed in a lifestyle scene – on a marble countertop, in a cozy kitchen, on a beach at sunset.
The workflow: segment the product, create a mask for the background, then inpaint the background with a new scene description.
| |
A few things matter here. The mask quality makes or breaks the result. Use a segmentation model like SAM or rembg to generate a clean product mask. If the mask bleeds into the product edges, you’ll get artifacts where the product meets the new background.
For the prompt, be specific about lighting and surface materials. “Soft studio lighting” and “shallow depth of field” produce the most realistic product shots. Avoid vague prompts like “nice background” – the model needs concrete visual details.
Generate Product Scenes from Text with SDXL
When you don’t have a reference photo at all, SDXL can generate full product scenes from scratch. This works well for concept mockups or when you need a product that doesn’t physically exist yet.
| |
SDXL’s native 1024x1024 resolution is a big deal for product photography. Lower resolutions produce soft details that scream “AI generated.” At 1024x1024, you get crisp edges on glass, metal, and fabric textures.
Prompting Tips for Product Photos
Your prompt structure matters more than prompt length. Stack these elements in order:
- Subject: “a matte black wireless earbud case”
- Surface/placement: “on a slate tile, surrounded by eucalyptus leaves”
- Lighting: “rim lighting from behind, soft fill light from the right”
- Style modifiers: “commercial product photography, editorial, sharp focus”
- Quality tokens: “8k, photorealistic, high detail”
Avoid cramming too many objects into the scene. One product, one surface, one lighting setup. The model handles simple compositions far better than cluttered ones.
Keep Products Consistent with IP-Adapter
The hardest problem in AI product photography is consistency. You generate a great shot, but the next one looks like a different product entirely. IP-Adapter solves this by conditioning the generation on a reference image, so the model preserves visual identity across scenes.
| |
The ip_adapter_scale parameter controls how strongly the reference image influences the output. At 0.6, the model balances between your text prompt (scene description) and the reference image (product appearance). Push it to 0.8 or higher if the product shape and color aren’t coming through. Drop it to 0.4 if the scene is getting ignored.
For best results, use a clean reference image with a neutral background. A product shot on pure white works better than one with a busy scene behind it. The IP-Adapter extracts visual features from the entire image, so background clutter leaks into the generation.
Batch Generation for Catalogs
When you need multiple scenes for a product catalog, loop through prompt variations and save each result with a descriptive filename:
| |
This gives you four different lifestyle shots of the same product in under two minutes on an A100.
Common Errors and Fixes
OutOfMemoryError: CUDA out of memory – SDXL at 1024x1024 needs around 12GB VRAM. Enable attention slicing with pipe.enable_attention_slicing() or use pipe.enable_model_cpu_offload() to move unused components to CPU. This trades speed for lower memory usage.
Product looks different from reference with IP-Adapter – Increase ip_adapter_scale from 0.6 to 0.8. If the product still drifts, check that your reference image has a clean background. Crop tightly around the product before passing it in.
Inpainting bleeds into the product area – Your mask isn’t clean enough. Dilate the mask by a few pixels to create a buffer zone around the product edge. In PIL: mask = mask.filter(ImageFilter.MaxFilter(5)).
Generated backgrounds look flat or pasted-on – Add lighting direction to your prompt. “Soft light from the upper left” or “rim lighting from behind” gives the model enough context to render consistent shadows and reflections that match the product.
xformers not installed error – Either install xformers with pip install xformers or replace enable_xformers_memory_efficient_attention() with pipe.enable_attention_slicing() as a fallback. Both reduce memory usage, but xformers is faster.
Inconsistent quality across batch generations – Set a fixed seed with generator=torch.Generator("cuda").manual_seed(42) for reproducible results. Then iterate on the seed until you find one that produces clean outputs for your product category.
Related Guides
- How to Edit Images with AI Inpainting Using Stable Diffusion
- How to Build AI Texture Generation for Game Assets with Stable Diffusion
- How to Fine-Tune Stable Diffusion with LoRA and DreamBooth
- How to Generate Images with Stable Diffusion in Python
- How to Generate Videos with Stable Video Diffusion
- How to Build AI Clothing Try-On with Virtual Diffusion Models
- How to Control Image Generation with ControlNet and IP-Adapter
- How to Generate 3D Models from Text and Images with AI
- How to Generate Images in Real Time with Latent Consistency Models
- How to Build AI Wireframe to UI Generation with Diffusion Models