The Short Version
Full fine-tuning Stable Diffusion retrains every parameter in the UNet. That’s ~860 million weights for SD 1.5 and ~2.6 billion for SDXL. You need multiple A100s and hours of training time. LoRA freezes the base model and trains small rank-decomposition matrices injected into the attention layers. The result: a 4-20MB adapter file instead of a multi-gigabyte checkpoint, trainable on a single 24GB GPU.
DreamBooth takes this further. You give it 5-20 photos of a specific subject – your dog, your face, a product – and it learns to associate that subject with a unique token. Combine DreamBooth with LoRA, and you get subject-driven generation without blowing your VRAM budget.
Here’s the full pipeline: prepare your dataset, train a LoRA with the diffusers training scripts, load the adapter at inference, and optionally merge multiple LoRAs.
Install Dependencies
| |
You also need accelerate configured for your hardware:
| |
This generates a ~/.cache/huggingface/accelerate/default_config.yaml. For a single GPU setup, the defaults work fine. For multi-GPU, pick the distributed training option.
Prepare Your Dataset
LoRA training needs image-caption pairs. DreamBooth needs images of your subject plus optional regularization images.
For Style LoRA
Organize your images in a folder with a metadata file:
| |
Each line in metadata.jsonl maps a filename to a caption:
| |
Captions matter more than you’d think. Vague captions produce vague LoRAs. Be specific about what makes each image representative of the style or subject you’re training.
For DreamBooth Subject Training
Collect 5-20 images of your subject. Consistent lighting and varied angles work best. Name your subject with a rare token to avoid collisions – something like sks or ohwx:
| |
Regularization images prevent the model from forgetting what the class looks like in general. Generate 200+ images of the class (e.g., “a photo of a dog”) using the base model before training:
| |
This gives DreamBooth a reference for “what dogs normally look like” so it doesn’t overfit your specific dog onto every dog prompt.
Train a LoRA with Diffusers
The diffusers library ships training scripts you can run directly. Clone the repo to get them:
| |
Launch LoRA training on Stable Diffusion XL:
| |
Key parameters to tune:
--rank: LoRA rank. Higher rank = more expressive but larger adapter. 4-8 for subtle style shifts, 16-32 for complex subjects, 64+ for major style overhauls. Start at 32.--learning_rate: 1e-4 is a solid default for SDXL LoRA. Go lower (5e-5) if you see artifacts early.--num_train_epochs: Depends on dataset size. 100 epochs on 50 images is ~5000 steps. Watch your loss curve and stop when it plateaus.--resolution: Match the base model. 512 for SD 1.5, 1024 for SDXL.
Training SDXL LoRA at rank 32 with batch size 1 and gradient accumulation of 4 uses about 18GB VRAM. An RTX 4090 handles it comfortably. An RTX 3090 works too with --gradient_checkpointing.
Train DreamBooth with LoRA
DreamBooth has its own training script. The key difference: you’re teaching the model a specific subject bound to a unique identifier token.
| |
| |
The --with_prior_preservation_loss flag is what keeps the model from catastrophic forgetting. It generates class images on-the-fly (or uses your pre-generated ones from --class_data_dir) and mixes them into the training batch. Without this, the model starts generating your specific dog for every dog prompt after a few hundred steps.
DreamBooth LoRA typically converges faster than style LoRA. Start with 500-1000 steps and check outputs. Overfitting shows up as the model reproducing your training images verbatim instead of generalizing.
Load and Use Your Trained LoRA
After training, you get a pytorch_lora_weights.safetensors file in your output directory. Loading it takes two lines:
| |
Adjust LoRA Strength at Inference
You don’t have to use the LoRA at full strength. Scale it between 0 and 1 to control how much influence the adapter has:
| |
This is useful for blending. A style LoRA at 0.3-0.5 gives a hint of the trained style without overwhelming the base model’s capabilities.
Merge Multiple LoRAs
You can stack multiple LoRAs – say a style LoRA and a subject LoRA – on the same pipeline:
| |
Order doesn’t matter for loading, but the weights do. Experiment with different ratios. Two LoRAs at full strength often conflict and produce muddy results.
LoRA vs Full Fine-Tuning vs DreamBooth
| Approach | VRAM | Training Time | Output Size | Best For |
|---|---|---|---|---|
| LoRA | 16-24GB | 1-4 hours | 4-50MB | Styles, concepts, general adaptation |
| DreamBooth + LoRA | 16-24GB | 30min-2 hours | 4-50MB | Specific subjects (faces, objects, characters) |
| DreamBooth (full) | 40GB+ | 2-8 hours | 2-6GB | Maximum subject fidelity |
| Full fine-tuning | 80GB+ | 6-24 hours | 2-6GB | New model variants, large dataset adaptation |
LoRA is the right choice 90% of the time. Full fine-tuning only makes sense if you’re training on tens of thousands of images and need the model to fundamentally shift its output distribution. DreamBooth without LoRA produces better subject fidelity but the checkpoint is enormous and you can’t easily combine it with other adaptations.
My recommendation: start with LoRA at rank 32. If the results aren’t capturing your subject well enough, try DreamBooth + LoRA. Only go to full DreamBooth or full fine-tuning if you’ve exhausted LoRA options.
Common Errors
CUDA Out of Memory
| |
Add --gradient_checkpointing to your training command. This trades compute for memory by recomputing activations during the backward pass instead of storing them. It slows training by ~20% but cuts memory usage significantly. Also try reducing --train_batch_size to 1 and increasing --gradient_accumulation_steps to compensate.
LoRA Weights Don’t Load
| |
This happens when you train on SD 1.5 but try to load the LoRA on SDXL (or vice versa). LoRA weights are architecture-specific. Check which base model you trained against and load that same model at inference.
DreamBooth Overfitting
If your generated images look like exact copies of training images, you’ve overfit. Reduce --max_train_steps, lower the learning rate, or add more regularization images. A good rule of thumb: if you have 10 instance images, use at least 200 class images for prior preservation.
Images Look Washed Out or Blurry After LoRA
Your LoRA scale might be too high, or training went too long. Try lora_scale=0.5 at inference. If the problem persists, retrain with a lower learning rate or fewer epochs. Also check that your training images were high-resolution – garbage in, garbage out.
Accelerate Config Issues
| |
Run accelerate config default again and make sure your setup matches your hardware. For a single GPU, the defaults are almost always correct. If you’re getting device placement errors, explicitly set --mixed_precision="fp16" in both your accelerate config and training command.
Related Guides
- How to Edit Images with AI Inpainting Using Stable Diffusion
- How to Generate Images with Stable Diffusion in Python
- How to Build AI Sticker and Emoji Generation with Stable Diffusion
- How to Build AI Motion Graphics Generation with Deforum Stable Diffusion
- How to Build AI Wallpaper Generation with Stable Diffusion and Tiling
- How to Build AI Seamless Pattern Generation with Stable Diffusion
- How to Build AI Architectural Rendering with ControlNet and Stable Diffusion
- How to Generate AI Product Photography with Diffusion Models
- How to Generate Videos with Stable Video Diffusion
- How to Build AI Texture Generation for Game Assets with Stable Diffusion