The Short Version

Full fine-tuning Stable Diffusion retrains every parameter in the UNet. That’s ~860 million weights for SD 1.5 and ~2.6 billion for SDXL. You need multiple A100s and hours of training time. LoRA freezes the base model and trains small rank-decomposition matrices injected into the attention layers. The result: a 4-20MB adapter file instead of a multi-gigabyte checkpoint, trainable on a single 24GB GPU.

DreamBooth takes this further. You give it 5-20 photos of a specific subject – your dog, your face, a product – and it learns to associate that subject with a unique token. Combine DreamBooth with LoRA, and you get subject-driven generation without blowing your VRAM budget.

Here’s the full pipeline: prepare your dataset, train a LoRA with the diffusers training scripts, load the adapter at inference, and optionally merge multiple LoRAs.

Install Dependencies

1
2
pip install diffusers transformers accelerate peft bitsandbytes torch torchvision
pip install datasets wandb  # optional: dataset loading and experiment tracking

You also need accelerate configured for your hardware:

1
accelerate config default

This generates a ~/.cache/huggingface/accelerate/default_config.yaml. For a single GPU setup, the defaults work fine. For multi-GPU, pick the distributed training option.

Prepare Your Dataset

LoRA training needs image-caption pairs. DreamBooth needs images of your subject plus optional regularization images.

For Style LoRA

Organize your images in a folder with a metadata file:

1
2
3
4
5
dataset/
  image_001.png
  image_002.png
  ...
  metadata.jsonl

Each line in metadata.jsonl maps a filename to a caption:

1
2
{"file_name": "image_001.png", "text": "a landscape painting in impressionist style, vibrant colors"}
{"file_name": "image_002.png", "text": "a portrait in impressionist style, loose brushstrokes"}

Captions matter more than you’d think. Vague captions produce vague LoRAs. Be specific about what makes each image representative of the style or subject you’re training.

For DreamBooth Subject Training

Collect 5-20 images of your subject. Consistent lighting and varied angles work best. Name your subject with a rare token to avoid collisions – something like sks or ohwx:

1
2
3
4
instance_data/
  subject_01.jpg
  subject_02.jpg
  ...

Regularization images prevent the model from forgetting what the class looks like in general. Generate 200+ images of the class (e.g., “a photo of a dog”) using the base model before training:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
import torch
from diffusers import StableDiffusionXLPipeline

pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
).to("cuda")

for i in range(200):
    image = pipe("a photo of a dog", num_inference_steps=25).images[0]
    image.save(f"class_data/dog_{i:04d}.png")

This gives DreamBooth a reference for “what dogs normally look like” so it doesn’t overfit your specific dog onto every dog prompt.

Train a LoRA with Diffusers

The diffusers library ships training scripts you can run directly. Clone the repo to get them:

1
2
3
git clone https://github.com/huggingface/diffusers.git
cd diffusers/examples/text_to_image
pip install -r requirements.txt

Launch LoRA training on Stable Diffusion XL:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
accelerate launch train_text_to_image_lora_sdxl.py \
  --pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0" \
  --dataset_name="./dataset" \
  --caption_column="text" \
  --resolution=1024 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --num_train_epochs=100 \
  --learning_rate=1e-4 \
  --lr_scheduler="cosine" \
  --lr_warmup_steps=100 \
  --rank=32 \
  --mixed_precision="fp16" \
  --output_dir="./sdxl-lora-output" \
  --checkpointing_steps=500 \
  --seed=42 \
  --report_to="wandb"

Key parameters to tune:

  • --rank: LoRA rank. Higher rank = more expressive but larger adapter. 4-8 for subtle style shifts, 16-32 for complex subjects, 64+ for major style overhauls. Start at 32.
  • --learning_rate: 1e-4 is a solid default for SDXL LoRA. Go lower (5e-5) if you see artifacts early.
  • --num_train_epochs: Depends on dataset size. 100 epochs on 50 images is ~5000 steps. Watch your loss curve and stop when it plateaus.
  • --resolution: Match the base model. 512 for SD 1.5, 1024 for SDXL.

Training SDXL LoRA at rank 32 with batch size 1 and gradient accumulation of 4 uses about 18GB VRAM. An RTX 4090 handles it comfortably. An RTX 3090 works too with --gradient_checkpointing.

Train DreamBooth with LoRA

DreamBooth has its own training script. The key difference: you’re teaching the model a specific subject bound to a unique identifier token.

1
2
cd diffusers/examples/dreambooth
pip install -r requirements.txt
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
accelerate launch train_dreambooth_lora_sdxl.py \
  --pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0" \
  --instance_data_dir="./instance_data" \
  --class_data_dir="./class_data" \
  --instance_prompt="a photo of sks dog" \
  --class_prompt="a photo of dog" \
  --with_prior_preservation_loss \
  --prior_loss_weight=1.0 \
  --num_class_images=200 \
  --resolution=1024 \
  --train_batch_size=1 \
  --gradient_accumulation_steps=4 \
  --learning_rate=1e-4 \
  --lr_scheduler="constant" \
  --lr_warmup_steps=0 \
  --max_train_steps=1000 \
  --rank=16 \
  --mixed_precision="fp16" \
  --output_dir="./dreambooth-lora-output" \
  --seed=42

The --with_prior_preservation_loss flag is what keeps the model from catastrophic forgetting. It generates class images on-the-fly (or uses your pre-generated ones from --class_data_dir) and mixes them into the training batch. Without this, the model starts generating your specific dog for every dog prompt after a few hundred steps.

DreamBooth LoRA typically converges faster than style LoRA. Start with 500-1000 steps and check outputs. Overfitting shows up as the model reproducing your training images verbatim instead of generalizing.

Load and Use Your Trained LoRA

After training, you get a pytorch_lora_weights.safetensors file in your output directory. Loading it takes two lines:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
import torch
from diffusers import StableDiffusionXLPipeline

pipe = StableDiffusionXLPipeline.from_pretrained(
    "stabilityai/stable-diffusion-xl-base-1.0",
    torch_dtype=torch.float16,
).to("cuda")

# Load your LoRA weights
pipe.load_lora_weights("./sdxl-lora-output")

image = pipe(
    "a cyberpunk cityscape in impressionist style",
    num_inference_steps=30,
    guidance_scale=7.5,
).images[0]
image.save("lora_output.png")

Adjust LoRA Strength at Inference

You don’t have to use the LoRA at full strength. Scale it between 0 and 1 to control how much influence the adapter has:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
pipe.load_lora_weights("./sdxl-lora-output")

# Full LoRA effect
pipe.fuse_lora(lora_scale=1.0)

# Subtle effect -- 50% strength
pipe.fuse_lora(lora_scale=0.5)

# Remove LoRA influence entirely
pipe.unfuse_lora()
pipe.unload_lora_weights()

This is useful for blending. A style LoRA at 0.3-0.5 gives a hint of the trained style without overwhelming the base model’s capabilities.

Merge Multiple LoRAs

You can stack multiple LoRAs – say a style LoRA and a subject LoRA – on the same pipeline:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# Load first LoRA (style)
pipe.load_lora_weights("./style-lora", adapter_name="style")

# Load second LoRA (subject)
pipe.load_lora_weights("./dreambooth-lora-output", adapter_name="subject")

# Set weights for each
pipe.set_adapters(["style", "subject"], adapter_weights=[0.7, 1.0])

image = pipe("a photo of sks dog in impressionist style").images[0]

Order doesn’t matter for loading, but the weights do. Experiment with different ratios. Two LoRAs at full strength often conflict and produce muddy results.

LoRA vs Full Fine-Tuning vs DreamBooth

ApproachVRAMTraining TimeOutput SizeBest For
LoRA16-24GB1-4 hours4-50MBStyles, concepts, general adaptation
DreamBooth + LoRA16-24GB30min-2 hours4-50MBSpecific subjects (faces, objects, characters)
DreamBooth (full)40GB+2-8 hours2-6GBMaximum subject fidelity
Full fine-tuning80GB+6-24 hours2-6GBNew model variants, large dataset adaptation

LoRA is the right choice 90% of the time. Full fine-tuning only makes sense if you’re training on tens of thousands of images and need the model to fundamentally shift its output distribution. DreamBooth without LoRA produces better subject fidelity but the checkpoint is enormous and you can’t easily combine it with other adaptations.

My recommendation: start with LoRA at rank 32. If the results aren’t capturing your subject well enough, try DreamBooth + LoRA. Only go to full DreamBooth or full fine-tuning if you’ve exhausted LoRA options.

Common Errors

CUDA Out of Memory

1
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 GiB

Add --gradient_checkpointing to your training command. This trades compute for memory by recomputing activations during the backward pass instead of storing them. It slows training by ~20% but cuts memory usage significantly. Also try reducing --train_batch_size to 1 and increasing --gradient_accumulation_steps to compensate.

LoRA Weights Don’t Load

1
ValueError: The provided state dict does not match the UNet's state dict

This happens when you train on SD 1.5 but try to load the LoRA on SDXL (or vice versa). LoRA weights are architecture-specific. Check which base model you trained against and load that same model at inference.

DreamBooth Overfitting

If your generated images look like exact copies of training images, you’ve overfit. Reduce --max_train_steps, lower the learning rate, or add more regularization images. A good rule of thumb: if you have 10 instance images, use at least 200 class images for prior preservation.

Images Look Washed Out or Blurry After LoRA

Your LoRA scale might be too high, or training went too long. Try lora_scale=0.5 at inference. If the problem persists, retrain with a lower learning rate or fewer epochs. Also check that your training images were high-resolution – garbage in, garbage out.

Accelerate Config Issues

1
RuntimeError: Expected all tensors to be on the same device

Run accelerate config default again and make sure your setup matches your hardware. For a single GPU, the defaults are almost always correct. If you’re getting device placement errors, explicitly set --mixed_precision="fp16" in both your accelerate config and training command.