How to Build AI Clothing Try-On with Virtual Diffusion Models

Virtual clothing try-on takes a person image and a flat garment image, then produces a realistic composite where the person appears to be wearing that garment. The pipeline has four stages: segment the body region where the garment goes, warp the garment to match the person’s pose, inpaint the warped garment onto the person with a diffusion model, and clean up the edges. Here is the full pipeline wired together so you can see what we are building before we break it apart.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
import torch
import numpy as np
from PIL import Image
from diffusers import StableDiffusionInpaintPipeline
from transformers import AutoModelForSemanticSegmentation, AutoImageProcessor
import cv2

# Load models once
seg_processor = AutoImageProcessor.from_pretrained("mattmdjaga/segformer_b2_clothes")
seg_model = AutoModelForSemanticSegmentation.from_pretrained("mattmdjaga/segformer_b2_clothes")

inpaint_pipe = StableDiffusionInpaintPipeline.from_pretrained(
    "runwayml/stable-diffusion-inpainting",
    torch_dtype=torch.float16,
    safety_checker=None,
)
inpaint_pipe = inpaint_pipe.to("cuda")

# Inputs
person_img = Image.open("person.jpg").convert("RGB")
garment_img = Image.open("garment.jpg").convert("RGB")

That gives you the two core models: a clothing segmentation model to identify where the current garment sits on the body, and Stable Diffusion inpainting to blend the new garment in.

Extracting the Body Segmentation Mask

The first real step is figuring out which pixels on the person image belong to the upper-body clothing region. We use SegFormer fine-tuned on clothing categories. The model outputs per-pixel class labels, and we pull out the classes that correspond to upper garments.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def get_clothing_mask(image: Image.Image) -> Image.Image:
    """Extract upper-body clothing mask from a person image."""
    inputs = seg_processor(images=image, return_tensors="pt")

    with torch.no_grad():
        outputs = seg_model(**inputs)

    logits = outputs.logits
    upsampled = torch.nn.functional.interpolate(
        logits, size=image.size[::-1], mode="bilinear", align_corners=False
    )
    pred = upsampled.argmax(dim=1).squeeze().cpu().numpy()

    # Class 4 = upper-clothes in the mattmdjaga/segformer_b2_clothes label map
    upper_clothes_mask = (pred == 4).astype(np.uint8) * 255

    # Clean up small holes and noise
    kernel = cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (7, 7))
    upper_clothes_mask = cv2.morphologyEx(upper_clothes_mask, cv2.MORPH_CLOSE, kernel)
    upper_clothes_mask = cv2.morphologyEx(upper_clothes_mask, cv2.MORPH_OPEN, kernel)

    return Image.fromarray(upper_clothes_mask, mode="L")

clothing_mask = get_clothing_mask(person_img)
clothing_mask.save("clothing_mask.png")

The morphological operations matter more than you would think. Without them you get ragged mask edges that bleed into the final composite. The close operation fills small gaps, and the open operation removes stray pixels around the collar and sleeves.

Choosing the Right Segmentation Classes

The segformer_b2_clothes model uses these label indices for common regions: 0 = background, 1 = hat, 2 = hair, 3 = sunglasses, 4 = upper-clothes, 5 = skirt, 6 = pants, 7 = dress, 8 = belt. If you are building a full-body try-on, combine classes 4, 5, 6, and 7 into a single mask. For upper-body only, stick with class 4.

Garment Warping with Thin Plate Spline Transforms

A flat garment photo does not match the person’s pose. The shoulders might be at different angles, the torso could be turned. TPS (Thin Plate Spline) warping deforms the garment image to align with the target body shape. You define source keypoints on the garment and destination keypoints on the person, then TPS computes a smooth deformation field.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
def tps_warp_garment(
    garment: Image.Image,
    src_points: np.ndarray,
    dst_points: np.ndarray,
    output_size: tuple,
) -> Image.Image:
    """Warp garment image using Thin Plate Spline transform.

    Args:
        garment: Flat garment image (RGBA or RGB).
        src_points: Nx2 array of keypoints on the garment.
        dst_points: Nx2 array of corresponding keypoints on the person.
        output_size: (width, height) of the output canvas.
    """
    garment_np = np.array(garment)

    # OpenCV TPS
    tps = cv2.createThinPlateSplineShapeTransformer()

    src_pts = src_points.reshape(1, -1, 2).astype(np.float32)
    dst_pts = dst_points.reshape(1, -1, 2).astype(np.float32)

    matches = [cv2.DMatch(i, i, 0) for i in range(len(src_points))]
    tps.estimateTransformation(dst_pts, src_pts, matches)

    # Apply the warp
    warped = tps.warpImage(garment_np)

    # Crop to output size
    h, w = output_size[1], output_size[0]
    warped = cv2.resize(warped, (w, h), interpolation=cv2.INTER_LANCZOS4)

    return Image.fromarray(warped)


# Example keypoints: shoulders, waist corners, neckline center
# In production you extract these from a pose estimator like MediaPipe or OpenPose
src_keypoints = np.array([
    [80, 20],   # left shoulder on garment
    [320, 20],  # right shoulder on garment
    [200, 10],  # neckline center
    [60, 380],  # left waist
    [340, 380], # right waist
], dtype=np.float32)

dst_keypoints = np.array([
    [145, 180],  # left shoulder on person
    [365, 195],  # right shoulder on person
    [255, 160],  # neckline center on person
    [130, 420],  # left waist on person
    [380, 430],  # right waist on person
], dtype=np.float32)

garment_resized = garment_img.resize(person_img.size)
warped_garment = tps_warp_garment(garment_resized, src_keypoints, dst_keypoints, person_img.size)
warped_garment.save("warped_garment.png")

In a production system, you would not hard-code keypoints. You would run MediaPipe Pose or OpenPose on both images, match the landmark indices, and feed those directly into the TPS transform. The hard-coded values above are for a quick sanity check.

Why TPS Over Affine

Affine transforms handle rotation, scale, and shear, but they cannot model the non-rigid deformation of fabric. TPS gives you local control. If the left shoulder needs to shift up while the right shoulder stays put, TPS handles that naturally. The tradeoff is you need at least five well-placed control points.

Diffusion Inpainting for Garment Blending

Now we have a warped garment and a mask showing where it should go. We composite the warped garment onto the person image in the masked region, then run Stable Diffusion inpainting to fix the seams, add realistic wrinkles, and blend lighting.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
def composite_and_inpaint(
    person: Image.Image,
    warped_garment: Image.Image,
    mask: Image.Image,
    pipe: StableDiffusionInpaintPipeline,
    prompt: str = "person wearing a garment, photorealistic, natural lighting, high quality",
) -> Image.Image:
    """Blend warped garment onto person using diffusion inpainting."""
    # Rough composite: paste warped garment into the masked region
    person_np = np.array(person)
    garment_np = np.array(warped_garment.resize(person.size))
    mask_np = np.array(mask.resize(person.size)) / 255.0

    # Alpha-blend the garment into the clothing region
    composite = person_np.copy()
    for c in range(3):
        composite[:, :, c] = (
            garment_np[:, :, c] * mask_np + person_np[:, :, c] * (1 - mask_np)
        ).astype(np.uint8)

    composite_img = Image.fromarray(composite)

    # Dilate mask slightly so inpainting covers the seam edges
    mask_dilated = cv2.dilate(
        np.array(mask.resize(person.size)),
        cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (15, 15)),
        iterations=1,
    )
    inpaint_mask = Image.fromarray(mask_dilated).convert("RGB")

    # Resize to 512x512 for Stable Diffusion
    target_size = (512, 512)
    composite_resized = composite_img.resize(target_size)
    mask_resized = inpaint_mask.resize(target_size)

    result = pipe(
        prompt=prompt,
        image=composite_resized,
        mask_image=mask_resized,
        num_inference_steps=30,
        guidance_scale=7.5,
        strength=0.6,
    ).images[0]

    # Scale back to original resolution
    return result.resize(person.size, Image.LANCZOS)


final_result = composite_and_inpaint(person_img, warped_garment, clothing_mask, inpaint_pipe)
final_result.save("tryon_result.png")

The strength parameter controls how much the diffusion model can deviate from the composite. At 0.6, it keeps the garment pattern and color but smooths transitions. Push it above 0.8 and the model starts hallucinating new patterns. Below 0.4, the seams stay visible. 0.55 to 0.65 is the sweet spot for most garment types.

Prompt Engineering for Try-On

Keep the prompt simple and descriptive. “Person wearing a blue t-shirt, studio lighting, photorealistic” works better than elaborate scene descriptions. Add negative prompts to avoid common artifacts:

1
negative_prompt = "deformed, blurry, bad anatomy, extra limbs, watermark, text, low quality"

Pass negative_prompt=negative_prompt to the pipeline call.

Post-Processing and Refinement

The inpainting output often has slight color shifts compared to the original garment. A histogram matching step corrects that.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
def match_color(source: np.ndarray, reference: np.ndarray, mask: np.ndarray) -> np.ndarray:
    """Match color statistics of source to reference within the masked region."""
    result = source.copy()
    for c in range(3):
        src_region = source[:, :, c][mask > 128]
        ref_region = reference[:, :, c][mask > 128]

        if len(src_region) == 0 or len(ref_region) == 0:
            continue

        src_mean, src_std = src_region.mean(), src_region.std() + 1e-6
        ref_mean, ref_std = ref_region.mean(), ref_region.std() + 1e-6

        adjusted = (source[:, :, c].astype(np.float32) - src_mean) * (ref_std / src_std) + ref_mean
        adjusted = np.clip(adjusted, 0, 255).astype(np.uint8)

        result[:, :, c] = np.where(mask > 128, adjusted, source[:, :, c])

    return result


result_np = np.array(final_result)
garment_ref = np.array(warped_garment.resize(final_result.size))
mask_np = np.array(clothing_mask.resize(final_result.size))

color_corrected = match_color(result_np, garment_ref, mask_np)
final_output = Image.fromarray(color_corrected)
final_output.save("tryon_final.png")

For edge blending, apply a Gaussian blur to the mask boundary before compositing:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# Feather the mask edges for smoother blending
mask_float = mask_np.astype(np.float32) / 255.0
mask_blurred = cv2.GaussianBlur(mask_float, (21, 21), 0)

blended = (
    color_corrected.astype(np.float32) * mask_blurred[:, :, None]
    + np.array(person_img.resize(final_result.size)).astype(np.float32)
    * (1 - mask_blurred[:, :, None])
)
blended = np.clip(blended, 0, 255).astype(np.uint8)
Image.fromarray(blended).save("tryon_blended.png")

Common Errors and Fixes

“CUDA out of memory” when running inpainting. The inpainting pipeline loads the full UNet and VAE. Use pipe.enable_attention_slicing() or pipe.enable_model_cpu_offload() to reduce VRAM usage. Both work on 8GB GPUs.

Warped garment looks distorted or stretched. Your TPS control points are misaligned. Verify that source and destination keypoints correspond to the same anatomical landmarks. Visualize both point sets on their respective images before warping.

Segmentation mask includes hair or background. The segformer_b2_clothes model can misclassify dark hair as clothing. Add a post-processing step that intersects the clothing mask with a rough torso bounding box from a pose estimator to filter out stray regions.

Inpainted result changes the garment color. Lower the strength parameter (try 0.45 to 0.55) and add the garment color to your prompt, for example “person wearing a red plaid shirt.” Also run the histogram matching step described above.

Visible seam lines at mask boundaries. Increase the dilation kernel size from 15 to 25 and increase the Gaussian blur kernel for edge feathering. The wider the transition zone, the smoother the blend, but too wide and you lose garment edge definition.

TPS transform crashes with fewer than 3 points. OpenCV TPS requires at least 3 non-collinear control points. Five points covering shoulders, neckline, and waist corners give the best balance of accuracy and stability.

Extracting the Body Segmentation Mask#

Choosing the Right Segmentation Classes#

Garment Warping with Thin Plate Spline Transforms#

Why TPS Over Affine#

Diffusion Inpainting for Garment Blending#

Prompt Engineering for Try-On#

Post-Processing and Refinement#

Common Errors and Fixes#

Related Guides#

About the Author