Axolotl in 60 Seconds
Axolotl is the framework that serious fine-tuners reach for. One YAML file controls your entire training pipeline: model, dataset, LoRA config, optimizer, and distributed training. No Python scripting needed for standard workflows.
Install it, write a config, run axolotl train. Here’s the fastest path to a working LoRA fine-tune on Llama 3:
| |
That trains a LoRA adapter on a 1B Llama 3 model with a preconfigured dataset. But you’re here because you have your own data. Let’s build a config from scratch.
Install Axolotl
Axolotl needs Python 3.11+ and PyTorch 2.8+. You also need an NVIDIA GPU – Ampere or newer (A100, 3090, 4090, H100) for bf16 and Flash Attention support.
| |
The --no-build-isolation flag is required because Flash Attention’s build step needs to see your installed PyTorch and CUDA. Without it, pip builds in an isolated venv and can’t find CUDA headers.
Docker is also available if you prefer a reproducible environment:
| |
Verify the install:
| |
Prepare Your Dataset
Axolotl supports several dataset formats. Pick the one that matches your data shape.
Alpaca Format (Instruction-Response)
Best for task-specific fine-tuning where each example is a standalone instruction-output pair. Save as a .jsonl file:
| |
Config:
| |
Chat Template Format (Multi-Turn Conversations)
This is the recommended format for chat models. It follows the OpenAI messages schema:
| |
Config:
| |
The roles_to_train: ["assistant"] setting means the loss is only computed on assistant responses, not on user messages or system prompts. This is what you want for instruction-following fine-tuning.
Completion Format (Raw Text)
For continued pretraining on a domain corpus:
| |
Config:
| |
Axolotl automatically splits texts that exceed your configured sequence_len into multiple training examples.
You can also load datasets directly from Hugging Face Hub:
| |
Write the Training Config
Here is a production-ready YAML config for LoRA fine-tuning Llama 3.1 8B on a single GPU:
| |
Key Config Decisions
lora_r: 32 – This is the rank of the LoRA matrices. Higher rank = more parameters = more capacity to learn. 16 works for simple tasks, 32 is a solid default, 64+ for complex domain adaptation. Each doubling roughly doubles LoRA parameter count.
lora_alpha: 64 – The scaling factor. A common rule of thumb is setting alpha to 2x the rank. The effective learning rate for LoRA layers scales as alpha / rank, so 64 / 32 = 2.0. Going higher makes LoRA updates more aggressive.
lora_target_linear: true – Applies LoRA to all linear layers in the model. This is simpler and generally better than manually specifying lora_target_modules: ["q_proj", "v_proj"]. More target modules means more parameters, but the training quality improvement is worth it.
gradient_accumulation_steps: 8 – With micro_batch_size: 2, your effective batch size is 2 * 8 = 16. On a single 24GB GPU with QLoRA, a micro batch of 2 at sequence length 4096 fits comfortably. If you’re running out of memory, drop to 1.
flash_attention: true – Requires Ampere or newer GPUs. Cuts attention memory from O(n^2) to O(n) and speeds up training significantly. There’s no reason to leave this off if your hardware supports it.
Run Training
With your config saved as config.yml:
| |
Axolotl handles accelerate configuration internally. You don’t need to run accelerate launch manually for single-GPU training.
To preprocess your dataset separately (useful for debugging data issues):
| |
This tokenizes everything and saves it to dataset_prepared_path. If preprocessing looks fine but training fails, you’ve isolated the problem to the training loop rather than data loading.
Watch for the first few loss values in the logs. If loss starts above 2-3 and drops quickly in the first 100 steps, your config is working. If loss stays flat or spikes, check your learning rate and dataset quality.
Multi-GPU Training
Axolotl supports three distributed strategies: DDP (default), FSDP, and DeepSpeed. For LoRA fine-tuning, FSDP2 is the recommended approach.
Add this to your config for a 2-4 GPU setup:
| |
Change transformer_layer_cls_to_wrap to match your model architecture. For Mistral it’s MistralDecoderLayer, for Qwen it’s Qwen3DecoderLayer.
If you need to offload parameters to CPU to fit a larger model, set offload_params: true. This is slower but lets you train models that wouldn’t fit in combined GPU memory.
For DeepSpeed, fetch the default configs and reference one:
| |
| |
Start with ZeRO Stage 1 and move to Stage 2 or 3 only if you need more memory savings. Each stage adds communication overhead.
When running multi-GPU, axolotl train automatically detects available GPUs and launches the distributed process. Your effective batch size becomes micro_batch_size * gradient_accumulation_steps * num_gpus.
Merge LoRA Adapters into the Base Model
After training, you have a LoRA adapter sitting in outputs/llama3-lora/. To deploy the model, merge the adapter weights back into the base model:
| |
The merged model lands in ./outputs/llama3-lora/merged/. This is a full-size model you can load with any Hugging Face-compatible inference tool – vLLM, TGI, llama.cpp (after quantization), or plain transformers.
Before merging, test the adapter with a quick inference pass:
| |
Or launch a Gradio UI:
| |
If merging runs out of GPU memory on large models, force it to CPU:
| |
Common Errors and Fixes
CUDA out of memory during training
Drop micro_batch_size to 1. Enable gradient_checkpointing: true if it’s not already on. Switch from load_in_4bit to full QLoRA with adapter: qlora. If that’s still not enough, reduce sequence_len. Every halving of sequence length roughly halves attention memory.
Exit code -9 (killed) This is the OS OOM killer, not a CUDA error. You’re out of system RAM, not GPU memory. This often happens during dataset preprocessing when Axolotl loads the entire dataset into memory. Reduce dataset size or add more system RAM. On cloud instances, pick a machine with at least 2x the model size in system RAM.
Size mismatch when merging LoRA
This happens when the tokenizer has a different vocabulary size than the model. Axolotl expands model embeddings when the tokenizer has extra tokens, but it won’t shrink them unless you set shrink_embeddings: true in your config. Always use axolotl merge-lora instead of custom merge scripts – it handles these edge cases.
Flash Attention build errors
If Flash Attention fails to compile during install, check your CUDA version. Flash Attention 2.8+ needs CUDA 12.6+. On CUDA 12.4, either upgrade CUDA or pin to flash-attn==2.7.4. You can also install Axolotl without Flash Attention and set flash_attention: false in config.
EOS token mismatch / garbage generation Your tokenizer’s EOS token doesn’t match what the chat template expects. Explicitly set it in the config:
| |
Check the model’s tokenizer_config.json on Hugging Face to find the correct token strings.
Training loss doesn’t decrease
First, verify your dataset format matches the type in your config. A mismatch means the model trains on garbled tokens. Second, try increasing learning_rate to 5e-4. Third, check that roles_to_train is set correctly for chat datasets – if you’re accidentally training on user turns, the signal-to-noise ratio tanks.
DeepSpeed errors on single GPU
Remove the deepspeed: line from your config. DeepSpeed requires at least 2 GPUs. If you see an MPI4PY error, that’s the same root cause.
Related Guides
- How to Fine-Tune LLMs with LoRA and Unsloth
- How to Fine-Tune LLMs with DPO and RLHF
- How to Fine-Tune Embedding Models for Domain-Specific Search
- How to Build Prefix Tuning for LLMs with PEFT and PyTorch
- How to Build a Knowledge Graph from Text with LLMs
- How to Distill Large LLMs into Smaller, Cheaper Models
- How to Build Prompt Versioning and Regression Testing for LLMs
- How to Build Multi-Turn Chatbots with Conversation Memory
- How to Manage Long Context Windows and Token Limits in LLM Apps
- How to Write Effective System Prompts for LLMs