Axolotl in 60 Seconds

Axolotl is the framework that serious fine-tuners reach for. One YAML file controls your entire training pipeline: model, dataset, LoRA config, optimizer, and distributed training. No Python scripting needed for standard workflows.

Install it, write a config, run axolotl train. Here’s the fastest path to a working LoRA fine-tune on Llama 3:

1
2
3
4
pip3 install -U packaging==26.0 setuptools==75.8.0 wheel ninja
pip3 install --no-build-isolation axolotl[flash-attn,deepspeed]
axolotl fetch examples
axolotl train examples/llama-3/lora-1b.yml

That trains a LoRA adapter on a 1B Llama 3 model with a preconfigured dataset. But you’re here because you have your own data. Let’s build a config from scratch.

Install Axolotl

Axolotl needs Python 3.11+ and PyTorch 2.8+. You also need an NVIDIA GPU – Ampere or newer (A100, 3090, 4090, H100) for bf16 and Flash Attention support.

1
2
pip3 install -U packaging==26.0 setuptools==75.8.0 wheel ninja
pip3 install --no-build-isolation axolotl[flash-attn,deepspeed]

The --no-build-isolation flag is required because Flash Attention’s build step needs to see your installed PyTorch and CUDA. Without it, pip builds in an isolated venv and can’t find CUDA headers.

Docker is also available if you prefer a reproducible environment:

1
docker run --gpus '"all"' --rm -it axolotlai/axolotl:main-latest

Verify the install:

1
python3 -c "import axolotl; print('Axolotl ready')"

Prepare Your Dataset

Axolotl supports several dataset formats. Pick the one that matches your data shape.

Alpaca Format (Instruction-Response)

Best for task-specific fine-tuning where each example is a standalone instruction-output pair. Save as a .jsonl file:

1
2
{"instruction": "Summarize the following text in one sentence.", "input": "The Federal Reserve raised interest rates by 25 basis points on Wednesday, marking the tenth consecutive increase. Chair Powell indicated that future decisions would depend on incoming economic data.", "output": "The Fed raised rates by 25bps for the tenth straight time, with Powell signaling a data-dependent approach going forward."}
{"instruction": "Convert this SQL query to a natural language question.", "input": "SELECT COUNT(*) FROM orders WHERE status = 'shipped' AND created_at > '2025-01-01'", "output": "How many orders with a shipped status were created after January 1, 2025?"}

Config:

1
2
3
datasets:
  - path: data/my_instructions.jsonl
    type: alpaca

Chat Template Format (Multi-Turn Conversations)

This is the recommended format for chat models. It follows the OpenAI messages schema:

1
{"messages": [{"role": "system", "content": "You are a Python debugging assistant."}, {"role": "user", "content": "Why does my list comprehension return None values?"}, {"role": "assistant", "content": "List comprehensions return None when the expression itself evaluates to None. This typically happens when you call a function that modifies in-place, like list.append() or list.sort(). For example: `[x.append(1) for x in lists]` returns [None, None, ...] because append() returns None. Use a regular for loop for side effects, or restructure the comprehension to return the value you want."}]}

Config:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
datasets:
  - path: data/my_conversations.jsonl
    type: chat_template
    chat_template: chatml
    field_messages: messages
    message_property_mappings:
      role: role
      content: content
    roles_to_train: ["assistant"]
    train_on_eos: "turn"

The roles_to_train: ["assistant"] setting means the loss is only computed on assistant responses, not on user messages or system prompts. This is what you want for instruction-following fine-tuning.

Completion Format (Raw Text)

For continued pretraining on a domain corpus:

1
{"text": "The transformer architecture uses self-attention mechanisms to process input sequences in parallel. Unlike RNNs, transformers have no recurrence..."}

Config:

1
2
3
datasets:
  - path: data/my_corpus.jsonl
    type: completion

Axolotl automatically splits texts that exceed your configured sequence_len into multiple training examples.

You can also load datasets directly from Hugging Face Hub:

1
2
3
datasets:
  - path: teknium/GPT4-LLM-Cleaned
    type: alpaca

Write the Training Config

Here is a production-ready YAML config for LoRA fine-tuning Llama 3.1 8B on a single GPU:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
base_model: meta-llama/Llama-3.1-8B-Instruct

# LoRA configuration
adapter: lora
lora_r: 32
lora_alpha: 64
lora_dropout: 0.05
lora_target_linear: true

# Quantization (QLoRA) -- remove both lines for full LoRA
load_in_4bit: true

# Dataset
datasets:
  - path: data/my_conversations.jsonl
    type: chat_template
    chat_template: chatml
    field_messages: messages
    message_property_mappings:
      role: role
      content: content
    roles_to_train: ["assistant"]
    train_on_eos: "turn"

dataset_prepared_path: last_run_prepared
val_set_size: 0.05

# Training hyperparameters
sequence_len: 4096
micro_batch_size: 2
gradient_accumulation_steps: 8
num_epochs: 3
learning_rate: 2e-4
lr_scheduler: cosine
warmup_ratio: 0.05
optimizer: adamw_torch_fused
weight_decay: 0.01
max_grad_norm: 1.0

# Performance
bf16: auto
flash_attention: true
gradient_checkpointing: true

# Output
output_dir: ./outputs/llama3-lora
logging_steps: 10
save_strategy: steps
save_steps: 100
eval_steps: 100

# Misc
special_tokens:
  pad_token: "<|end_of_text|>"

Key Config Decisions

lora_r: 32 – This is the rank of the LoRA matrices. Higher rank = more parameters = more capacity to learn. 16 works for simple tasks, 32 is a solid default, 64+ for complex domain adaptation. Each doubling roughly doubles LoRA parameter count.

lora_alpha: 64 – The scaling factor. A common rule of thumb is setting alpha to 2x the rank. The effective learning rate for LoRA layers scales as alpha / rank, so 64 / 32 = 2.0. Going higher makes LoRA updates more aggressive.

lora_target_linear: true – Applies LoRA to all linear layers in the model. This is simpler and generally better than manually specifying lora_target_modules: ["q_proj", "v_proj"]. More target modules means more parameters, but the training quality improvement is worth it.

gradient_accumulation_steps: 8 – With micro_batch_size: 2, your effective batch size is 2 * 8 = 16. On a single 24GB GPU with QLoRA, a micro batch of 2 at sequence length 4096 fits comfortably. If you’re running out of memory, drop to 1.

flash_attention: true – Requires Ampere or newer GPUs. Cuts attention memory from O(n^2) to O(n) and speeds up training significantly. There’s no reason to leave this off if your hardware supports it.

Run Training

With your config saved as config.yml:

1
axolotl train config.yml

Axolotl handles accelerate configuration internally. You don’t need to run accelerate launch manually for single-GPU training.

To preprocess your dataset separately (useful for debugging data issues):

1
axolotl preprocess config.yml

This tokenizes everything and saves it to dataset_prepared_path. If preprocessing looks fine but training fails, you’ve isolated the problem to the training loop rather than data loading.

Watch for the first few loss values in the logs. If loss starts above 2-3 and drops quickly in the first 100 steps, your config is working. If loss stays flat or spikes, check your learning rate and dataset quality.

Multi-GPU Training

Axolotl supports three distributed strategies: DDP (default), FSDP, and DeepSpeed. For LoRA fine-tuning, FSDP2 is the recommended approach.

Add this to your config for a 2-4 GPU setup:

1
2
3
4
5
6
7
8
fsdp_version: 2
fsdp_config:
  offload_params: false
  cpu_ram_efficient_loading: true
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  transformer_layer_cls_to_wrap: LlamaDecoderLayer
  state_dict_type: FULL_STATE_DICT
  reshard_after_forward: true

Change transformer_layer_cls_to_wrap to match your model architecture. For Mistral it’s MistralDecoderLayer, for Qwen it’s Qwen3DecoderLayer.

If you need to offload parameters to CPU to fit a larger model, set offload_params: true. This is slower but lets you train models that wouldn’t fit in combined GPU memory.

For DeepSpeed, fetch the default configs and reference one:

1
axolotl fetch deepspeed_configs
1
deepspeed: deepspeed_configs/zero2.json

Start with ZeRO Stage 1 and move to Stage 2 or 3 only if you need more memory savings. Each stage adds communication overhead.

When running multi-GPU, axolotl train automatically detects available GPUs and launches the distributed process. Your effective batch size becomes micro_batch_size * gradient_accumulation_steps * num_gpus.

Merge LoRA Adapters into the Base Model

After training, you have a LoRA adapter sitting in outputs/llama3-lora/. To deploy the model, merge the adapter weights back into the base model:

1
axolotl merge-lora config.yml --lora-model-dir="./outputs/llama3-lora"

The merged model lands in ./outputs/llama3-lora/merged/. This is a full-size model you can load with any Hugging Face-compatible inference tool – vLLM, TGI, llama.cpp (after quantization), or plain transformers.

Before merging, test the adapter with a quick inference pass:

1
axolotl inference config.yml --lora-model-dir="./outputs/llama3-lora"

Or launch a Gradio UI:

1
axolotl inference config.yml --lora-model-dir="./outputs/llama3-lora" --gradio

If merging runs out of GPU memory on large models, force it to CPU:

1
CUDA_VISIBLE_DEVICES="" axolotl merge-lora config.yml --lora-model-dir="./outputs/llama3-lora"

Common Errors and Fixes

CUDA out of memory during training Drop micro_batch_size to 1. Enable gradient_checkpointing: true if it’s not already on. Switch from load_in_4bit to full QLoRA with adapter: qlora. If that’s still not enough, reduce sequence_len. Every halving of sequence length roughly halves attention memory.

Exit code -9 (killed) This is the OS OOM killer, not a CUDA error. You’re out of system RAM, not GPU memory. This often happens during dataset preprocessing when Axolotl loads the entire dataset into memory. Reduce dataset size or add more system RAM. On cloud instances, pick a machine with at least 2x the model size in system RAM.

Size mismatch when merging LoRA This happens when the tokenizer has a different vocabulary size than the model. Axolotl expands model embeddings when the tokenizer has extra tokens, but it won’t shrink them unless you set shrink_embeddings: true in your config. Always use axolotl merge-lora instead of custom merge scripts – it handles these edge cases.

Flash Attention build errors If Flash Attention fails to compile during install, check your CUDA version. Flash Attention 2.8+ needs CUDA 12.6+. On CUDA 12.4, either upgrade CUDA or pin to flash-attn==2.7.4. You can also install Axolotl without Flash Attention and set flash_attention: false in config.

EOS token mismatch / garbage generation Your tokenizer’s EOS token doesn’t match what the chat template expects. Explicitly set it in the config:

1
2
3
special_tokens:
  eos_token: "<|eot_id|>"
  pad_token: "<|end_of_text|>"

Check the model’s tokenizer_config.json on Hugging Face to find the correct token strings.

Training loss doesn’t decrease First, verify your dataset format matches the type in your config. A mismatch means the model trains on garbled tokens. Second, try increasing learning_rate to 5e-4. Third, check that roles_to_train is set correctly for chat datasets – if you’re accidentally training on user turns, the signal-to-noise ratio tanks.

DeepSpeed errors on single GPU Remove the deepspeed: line from your config. DeepSpeed requires at least 2 GPUs. If you see an MPI4PY error, that’s the same root cause.