The Fast Path to Fine-Tuning
Full fine-tuning a 7B parameter model needs 4x A100 GPUs and days of training time. LoRA sidesteps this by freezing the base model weights and training small rank-decomposition matrices instead. You get 95%+ of full fine-tuning quality with a fraction of the compute.
Unsloth makes LoRA fine-tuning even faster. It patches the model’s attention and MLP layers with custom Triton kernels, cutting memory usage by 60% and doubling training speed. A Llama 3 8B QLoRA fine-tune fits on a single 16GB GPU. That means a T4 on Colab or a 4060 Ti on your desk.
Here’s the full workflow: install Unsloth, load a base model, attach LoRA adapters, train on your dataset, and save the result.
Install Unsloth
Unsloth requires a CUDA GPU. It works on Linux and WSL2 on Windows. macOS is not supported.
| |
The --no-deps flag on the second line prevents dependency conflicts between Unsloth’s pinned versions and what trl/peft want to install. If you’re on a fresh environment, you can skip --no-deps, but in practice it saves you from version hell.
For a specific CUDA version (e.g., CUDA 12.1 with PyTorch 2.4):
| |
Check that it installed correctly:
| |
Load a Base Model with QLoRA
Unsloth wraps the Hugging Face model loading with automatic 4-bit quantization. This is QLoRA – the model weights are quantized to 4 bits, but LoRA adapters train in 16-bit precision.
| |
The dtype=None setting is the right default. Unsloth picks bfloat16 on Ampere GPUs (A100, 3090, 4090) and float16 on older cards. You can force it with dtype=torch.float16 if you hit issues.
Setting load_in_4bit=True enables QLoRA. This drops VRAM usage from ~16GB to ~5GB for an 8B model. If you have a 24GB GPU, you can set load_in_4bit=False for standard LoRA, which trains slightly faster and produces marginally better results.
VRAM Requirements
| Model Size | QLoRA (4-bit) | LoRA (16-bit) |
|---|---|---|
| 7-8B | ~6 GB | ~16 GB |
| 13B | ~10 GB | ~28 GB |
| 70B | ~40 GB | ~140 GB |
Attach LoRA Adapters
Now add the LoRA adapter layers to the model. This is where you choose which layers to train and the LoRA rank.
| |
A few things worth noting:
- Rank (
r): 16 is the sweet spot for most tasks. Bump it to 32 or 64 for complex reasoning tasks. Rank 8 works fine for simple classification or style transfer. target_modules: Train all attention projections plus the MLP gate/up/down projections. This is more aggressive than the default LoRA papers suggested (just q_proj/v_proj), but it gives better results.use_gradient_checkpointing="unsloth": Unsloth’s custom gradient checkpointing uses 30% less VRAM than the standard HuggingFace implementation with no speed penalty. Always use it.lora_dropout=0: Unsloth’s Triton kernels are optimized for zero dropout. Setting dropout > 0 disables the fast path and slows training.
Prepare Your Dataset
Unsloth works with the Hugging Face datasets library. The SFTTrainer from trl expects a specific format. For instruction-tuning, use the chat template approach.
| |
For a custom dataset, load it from a JSON or CSV file:
| |
Your JSONL file should look like this:
| |
Train the Model
Use the SFTTrainer from trl. Unsloth patches it to run faster automatically.
| |
Training hyperparameters that matter:
learning_rate=2e-4: Standard for LoRA. Don’t go much higher or the adapters overfit fast. For small datasets (< 1000 examples), try5e-5.num_train_epochs: 1 epoch is usually enough for large datasets. 2-3 for small datasets. More than 3 almost always overfits.per_device_train_batch_size=2withgradient_accumulation_steps=4: Gives an effective batch size of 8. Increase gradient accumulation if you hit OOM.packing=False: Set toTrueif your examples are short (under 256 tokens). Packing concatenates multiple examples into one sequence, which improves GPU utilization.
Save and Merge the Model
After training, you have options. Save just the LoRA adapter (small, ~50MB) or merge it into the base model for direct inference.
Save the LoRA Adapter Only
| |
This creates a ~50-200MB directory you can share or upload to Hugging Face. To use it later, load the base model and apply the adapter.
Merge and Save the Full Model
| |
For GGUF export (for llama.cpp, Ollama, LM Studio):
| |
The q4_k_m quantization is the best default for GGUF. It produces a ~4.5GB file for an 8B model that runs well on consumer hardware.
Test Your Fine-Tuned Model
Run a quick inference to verify the model learned what you wanted.
| |
The FastLanguageModel.for_inference(model) call switches the model from training mode to inference mode, enabling Unsloth’s optimized inference kernels.
Common Errors and Fixes
OutOfMemoryError: CUDA out of memory
The most common error. Fix it by reducing batch size or enabling more aggressive memory saving:
| |
If you’re still OOM, reduce max_seq_length from 2048 to 1024. Sequence length has a quadratic effect on memory in attention layers.
RuntimeError: Expected all tensors to be on the same device
This happens when the model and input tensors are on different devices. Make sure your inputs are on CUDA:
| |
ValueError: Tokenizer class LlamaTokenizer does not exist or is not currently imported
You need transformers >= 4.37.0. Update it:
| |
ImportError: Using bitsandbytes with quantize_config.json is not supported
Version mismatch between bitsandbytes and transformers. Reinstall both:
| |
torch.cuda.OutOfMemoryError during model loading (before training starts)
The base model itself doesn’t fit in VRAM. Make sure load_in_4bit=True is set. If you’re loading a 70B model, you need at least 40GB VRAM even with 4-bit quantization.
Training loss stays flat or increases
Your learning rate is probably too high. Drop it from 2e-4 to 5e-5 or 1e-5. Also check that your data formatting is correct – if the model never sees the expected prompt format, it can’t learn the pattern.
Tips for Better Results
Use a higher rank for complex tasks. Rank 16 works for most things, but code generation and math reasoning benefit from rank 32 or 64. The trade-off is more VRAM and slightly slower training.
Clean your data aggressively. A fine-tune on 500 high-quality examples beats 10,000 noisy ones. Remove duplicates, fix formatting issues, and manually review a random sample before training.
Don’t train for too many epochs. Watch the training loss. If it drops below 0.5 and you’re seeing good results in spot checks, stop. Going further usually memorizes the training set rather than learning generalizable patterns.
Use the right base model. Start with an instruction-tuned variant (the -Instruct models) if you want to fine-tune for a specific task format. Start with the base model if you’re changing the model’s fundamental behavior or style.
Related Guides
- How to Fine-Tune LLMs with DPO and RLHF
- How to Fine-Tune LLMs on Custom Datasets with Axolotl
- How to Build Prefix Tuning for LLMs with PEFT and PyTorch
- How to Fine-Tune Embedding Models for Domain-Specific Search
- How to Distill Large LLMs into Smaller, Cheaper Models
- How to Build a Knowledge Graph from Text with LLMs
- How to Build Prompt Versioning and Regression Testing for LLMs
- How to Write Effective System Prompts for LLMs
- How to Build Structured Output Parsers with Pydantic and LLMs
- How to Build Multi-Turn Chatbots with Conversation Memory