The Fast Path to Fine-Tuning

Full fine-tuning a 7B parameter model needs 4x A100 GPUs and days of training time. LoRA sidesteps this by freezing the base model weights and training small rank-decomposition matrices instead. You get 95%+ of full fine-tuning quality with a fraction of the compute.

Unsloth makes LoRA fine-tuning even faster. It patches the model’s attention and MLP layers with custom Triton kernels, cutting memory usage by 60% and doubling training speed. A Llama 3 8B QLoRA fine-tune fits on a single 16GB GPU. That means a T4 on Colab or a 4060 Ti on your desk.

Here’s the full workflow: install Unsloth, load a base model, attach LoRA adapters, train on your dataset, and save the result.

Install Unsloth

Unsloth requires a CUDA GPU. It works on Linux and WSL2 on Windows. macOS is not supported.

1
2
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps trl peft accelerate bitsandbytes

The --no-deps flag on the second line prevents dependency conflicts between Unsloth’s pinned versions and what trl/peft want to install. If you’re on a fresh environment, you can skip --no-deps, but in practice it saves you from version hell.

For a specific CUDA version (e.g., CUDA 12.1 with PyTorch 2.4):

1
pip install "unsloth[cu121-torch240] @ git+https://github.com/unslothai/unsloth.git"

Check that it installed correctly:

1
2
from unsloth import FastLanguageModel
print("Unsloth loaded successfully")

Load a Base Model with QLoRA

Unsloth wraps the Hugging Face model loading with automatic 4-bit quantization. This is QLoRA – the model weights are quantized to 4 bits, but LoRA adapters train in 16-bit precision.

1
2
3
4
5
6
7
8
9
from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-Instruct",
    max_seq_length=2048,
    dtype=None,            # Auto-detect: float16 on older GPUs, bfloat16 on Ampere+
    load_in_4bit=True,     # QLoRA: 4-bit quantized base weights
)

The dtype=None setting is the right default. Unsloth picks bfloat16 on Ampere GPUs (A100, 3090, 4090) and float16 on older cards. You can force it with dtype=torch.float16 if you hit issues.

Setting load_in_4bit=True enables QLoRA. This drops VRAM usage from ~16GB to ~5GB for an 8B model. If you have a 24GB GPU, you can set load_in_4bit=False for standard LoRA, which trains slightly faster and produces marginally better results.

VRAM Requirements

Model SizeQLoRA (4-bit)LoRA (16-bit)
7-8B~6 GB~16 GB
13B~10 GB~28 GB
70B~40 GB~140 GB

Attach LoRA Adapters

Now add the LoRA adapter layers to the model. This is where you choose which layers to train and the LoRA rank.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
model = FastLanguageModel.get_peft_model(
    model,
    r=16,                          # LoRA rank -- 8 to 64, higher = more parameters
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha=16,                 # Scaling factor, usually equal to r
    lora_dropout=0,                # Unsloth optimizes for 0 dropout
    bias="none",
    use_gradient_checkpointing="unsloth",  # 30% less VRAM
    random_state=3407,
)

A few things worth noting:

  • Rank (r): 16 is the sweet spot for most tasks. Bump it to 32 or 64 for complex reasoning tasks. Rank 8 works fine for simple classification or style transfer.
  • target_modules: Train all attention projections plus the MLP gate/up/down projections. This is more aggressive than the default LoRA papers suggested (just q_proj/v_proj), but it gives better results.
  • use_gradient_checkpointing="unsloth": Unsloth’s custom gradient checkpointing uses 30% less VRAM than the standard HuggingFace implementation with no speed penalty. Always use it.
  • lora_dropout=0: Unsloth’s Triton kernels are optimized for zero dropout. Setting dropout > 0 disables the fast path and slows training.

Prepare Your Dataset

Unsloth works with the Hugging Face datasets library. The SFTTrainer from trl expects a specific format. For instruction-tuning, use the chat template approach.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
from datasets import load_dataset

# Load a dataset from Hugging Face Hub
dataset = load_dataset("yahma/alpaca-cleaned", split="train")

# Format into the chat template Llama expects
def format_prompts(examples):
    texts = []
    for instruction, input_text, output in zip(
        examples["instruction"], examples["input"], examples["output"]
    ):
        if input_text:
            prompt = f"### Instruction:\n{instruction}\n\n### Input:\n{input_text}\n\n### Response:\n{output}"
        else:
            prompt = f"### Instruction:\n{instruction}\n\n### Response:\n{output}"
        texts.append(prompt + tokenizer.eos_token)
    return {"text": texts}

dataset = dataset.map(format_prompts, batched=True)

For a custom dataset, load it from a JSON or CSV file:

1
2
# From a JSONL file with "instruction" and "response" fields
dataset = load_dataset("json", data_files="my_training_data.jsonl", split="train")

Your JSONL file should look like this:

1
2
{"instruction": "Summarize this article about quantum computing.", "input": "...", "output": "..."}
{"instruction": "Write a Python function that reverses a string.", "input": "", "output": "..."}

Train the Model

Use the SFTTrainer from trl. Unsloth patches it to run faster automatically.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    dataset_num_proc=2,
    packing=False,           # Set True to pack short examples together
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,  # Effective batch size = 2 * 4 = 8
        warmup_steps=5,
        num_train_epochs=1,            # 1-3 epochs for most fine-tunes
        learning_rate=2e-4,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=10,
        optim="adamw_8bit",            # 8-bit Adam saves VRAM
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
    ),
)

trainer_stats = trainer.train()

Training hyperparameters that matter:

  • learning_rate=2e-4: Standard for LoRA. Don’t go much higher or the adapters overfit fast. For small datasets (< 1000 examples), try 5e-5.
  • num_train_epochs: 1 epoch is usually enough for large datasets. 2-3 for small datasets. More than 3 almost always overfits.
  • per_device_train_batch_size=2 with gradient_accumulation_steps=4: Gives an effective batch size of 8. Increase gradient accumulation if you hit OOM.
  • packing=False: Set to True if your examples are short (under 256 tokens). Packing concatenates multiple examples into one sequence, which improves GPU utilization.

Save and Merge the Model

After training, you have options. Save just the LoRA adapter (small, ~50MB) or merge it into the base model for direct inference.

Save the LoRA Adapter Only

1
2
3
# Save adapter weights -- small and fast
model.save_pretrained("lora-adapter")
tokenizer.save_pretrained("lora-adapter")

This creates a ~50-200MB directory you can share or upload to Hugging Face. To use it later, load the base model and apply the adapter.

Merge and Save the Full Model

1
2
3
4
5
6
7
# Merge LoRA weights into the base model
# Save in float16 for vLLM / TGI serving
model.save_pretrained_merged(
    "merged-model",
    tokenizer,
    save_method="merged_16bit",
)

For GGUF export (for llama.cpp, Ollama, LM Studio):

1
2
3
4
5
6
# Save as GGUF quantized model
model.save_pretrained_gguf(
    "model-gguf",
    tokenizer,
    quantization_method="q4_k_m",  # Good balance of size and quality
)

The q4_k_m quantization is the best default for GGUF. It produces a ~4.5GB file for an 8B model that runs well on consumer hardware.

Test Your Fine-Tuned Model

Run a quick inference to verify the model learned what you wanted.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
FastLanguageModel.for_inference(model)

inputs = tokenizer(
    "### Instruction:\nWrite a Python function to check if a number is prime.\n\n### Response:\n",
    return_tensors="pt",
).to("cuda")

outputs = model.generate(
    **inputs,
    max_new_tokens=256,
    temperature=0.7,
    use_cache=True,
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

The FastLanguageModel.for_inference(model) call switches the model from training mode to inference mode, enabling Unsloth’s optimized inference kernels.

Common Errors and Fixes

OutOfMemoryError: CUDA out of memory

The most common error. Fix it by reducing batch size or enabling more aggressive memory saving:

1
2
3
4
5
6
# Reduce batch size
per_device_train_batch_size=1,
gradient_accumulation_steps=8,  # Keep effective batch size the same

# Or enable gradient checkpointing if you haven't
use_gradient_checkpointing="unsloth",

If you’re still OOM, reduce max_seq_length from 2048 to 1024. Sequence length has a quadratic effect on memory in attention layers.

RuntimeError: Expected all tensors to be on the same device

This happens when the model and input tensors are on different devices. Make sure your inputs are on CUDA:

1
inputs = tokenizer("your prompt", return_tensors="pt").to("cuda")

ValueError: Tokenizer class LlamaTokenizer does not exist or is not currently imported

You need transformers >= 4.37.0. Update it:

1
pip install --upgrade transformers

ImportError: Using bitsandbytes with quantize_config.json is not supported

Version mismatch between bitsandbytes and transformers. Reinstall both:

1
pip install --upgrade bitsandbytes transformers accelerate

torch.cuda.OutOfMemoryError during model loading (before training starts)

The base model itself doesn’t fit in VRAM. Make sure load_in_4bit=True is set. If you’re loading a 70B model, you need at least 40GB VRAM even with 4-bit quantization.

Training loss stays flat or increases

Your learning rate is probably too high. Drop it from 2e-4 to 5e-5 or 1e-5. Also check that your data formatting is correct – if the model never sees the expected prompt format, it can’t learn the pattern.

Tips for Better Results

Use a higher rank for complex tasks. Rank 16 works for most things, but code generation and math reasoning benefit from rank 32 or 64. The trade-off is more VRAM and slightly slower training.

Clean your data aggressively. A fine-tune on 500 high-quality examples beats 10,000 noisy ones. Remove duplicates, fix formatting issues, and manually review a random sample before training.

Don’t train for too many epochs. Watch the training loss. If it drops below 0.5 and you’re seeing good results in spot checks, stop. Going further usually memorizes the training set rather than learning generalizable patterns.

Use the right base model. Start with an instruction-tuned variant (the -Instruct models) if you want to fine-tune for a specific task format. Start with the base model if you’re changing the model’s fundamental behavior or style.