How to Fine-Tune LLMs with DPO and RLHF

DPO vs RLHF: Pick DPO Unless You Have a Good Reason Not To

RLHF (Reinforcement Learning from Human Feedback) is the classic approach. You train a reward model on human preference pairs, then use PPO (Proximal Policy Optimization) to optimize the language model against that reward signal. It works – ChatGPT was trained this way – but it’s a three-stage pipeline: supervised fine-tune, train reward model, run PPO. Each stage has its own hyperparameters, instabilities, and failure modes.

DPO (Direct Preference Optimization) collapses that into one step. Instead of training a separate reward model, DPO reformulates the preference optimization problem as a classification loss directly on the policy model. You give it pairs of (chosen, rejected) responses for the same prompt, and it learns to prefer the chosen response. No reward model, no PPO, no value function. One training loop.

In practice, DPO is more stable, easier to tune, and produces comparable results to PPO-based RLHF on most benchmarks. Use DPO unless you specifically need the flexibility of a separate reward model (e.g., for reward model ensembling or online RLHF with live generation).

Prepare Your Preference Dataset

DPO needs triplets: a prompt, a chosen (preferred) response, and a rejected (worse) response. The Hugging Face datasets library expects these as columns.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
from datasets import Dataset

# Each example: same prompt, two completions ranked by quality
preference_data = [
    {
        "prompt": "Explain gradient descent in one paragraph.",
        "chosen": "Gradient descent is an optimization algorithm that iteratively adjusts model parameters by moving in the direction of steepest decrease of the loss function. At each step, you compute the gradient of the loss with respect to the parameters, multiply by a learning rate, and subtract from the current parameters. The learning rate controls step size -- too large and you overshoot, too small and training crawls.",
        "rejected": "Gradient descent is a technique used in machine learning. It is very important and widely used. Many models use gradient descent for training. It helps find the optimal solution by adjusting parameters. It was invented a long time ago and remains popular today.",
    },
    {
        "prompt": "What is the difference between L1 and L2 regularization?",
        "chosen": "L1 regularization adds the sum of absolute values of weights to the loss. It drives small weights to exactly zero, producing sparse models. L2 adds the sum of squared weights, which shrinks all weights toward zero but rarely makes them exactly zero. Use L1 when you want feature selection, L2 when you want to prevent any single weight from dominating.",
        "rejected": "L1 and L2 are types of regularization. L1 uses absolute values and L2 uses squared values. They both help prevent overfitting. You can choose either one depending on your needs. Both are commonly used in practice.",
    },
]

dataset = Dataset.from_list(preference_data)
train_test = dataset.train_test_split(test_size=0.1, seed=42)

For real training, you want thousands of preference pairs. Good public datasets include Anthropic/hh-rlhf, OpenAssistant/oasst1, and argilla/ultrafeedback-binarized-preferences. Load them directly:

1
2
3
from datasets import load_dataset

dataset = load_dataset("argilla/ultrafeedback-binarized-preferences", split="train")

The dataset needs prompt, chosen, and rejected columns. If your dataset uses different column names, rename them before passing to the trainer.

Train with DPOTrainer

The TRL library from Hugging Face provides DPOTrainer, which handles the entire DPO training loop. Install it first:

1
pip install trl transformers accelerate peft bitsandbytes

Here is a full DPO training script using LoRA for memory efficiency:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import DPOConfig, DPOTrainer
from peft import LoraConfig
from datasets import load_dataset

# Load base model and tokenizer
model_name = "meta-llama/Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    load_in_4bit=True,
    device_map="auto",
)

# LoRA config -- keeps GPU memory manageable
peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules=["q_proj", "v_proj", "k_proj", "o_proj"],
    task_type="CAUSAL_LM",
)

# DPO training configuration
training_args = DPOConfig(
    output_dir="./dpo-llama3",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    num_train_epochs=1,
    learning_rate=5e-7,
    beta=0.1,                    # KL penalty weight -- lower = more aggressive optimization
    max_length=1024,
    max_prompt_length=512,
    logging_steps=10,
    save_steps=100,
    bf16=True,
    gradient_checkpointing=True,
    remove_unused_columns=False,
)

# Load preference dataset
dataset = load_dataset("argilla/ultrafeedback-binarized-preferences", split="train")
dataset = dataset.select(range(5000))  # subset for faster iteration

# Initialize and run
trainer = DPOTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    processing_class=tokenizer,
    peft_config=peft_config,
)

trainer.train()
trainer.save_model("./dpo-llama3-final")

Key Hyperparameters

beta is the most important DPO hyperparameter. It controls the KL divergence penalty against the reference model. Higher beta (0.5+) keeps the model close to the base – safer but smaller alignment gains. Lower beta (0.05-0.1) allows bigger behavioral shifts but risks degrading general capabilities. Start at 0.1 and adjust based on eval.

learning_rate should be much lower than supervised fine-tuning. DPO is sensitive to learning rate – 5e-7 to 5e-6 is the typical range. If your model collapses (starts generating repetitive or degenerate text), halve the learning rate first.

max_length and max_prompt_length control sequence truncation. The response portion is max_length - max_prompt_length. Make sure your preference data fits within these limits or you’ll silently lose training signal from truncated examples.

How RLHF with PPO Works (For Comparison)

If you do need the full RLHF pipeline, TRL supports that too. The workflow has three steps.

First, supervised fine-tune (SFT) a base model on instruction-following data. Second, train a reward model on preference pairs. Third, run PPO to optimize the SFT model against the reward model.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
from trl import PPOConfig, PPOTrainer, AutoModelForCausalLMWithValueHead

# PPO requires a model with a value head
model = AutoModelForCausalLMWithValueHead.from_pretrained("your-sft-model")

ppo_config = PPOConfig(
    batch_size=16,
    learning_rate=1.4e-5,
    mini_batch_size=4,
    ppo_epochs=4,
)

ppo_trainer = PPOTrainer(
    config=ppo_config,
    model=model,
    tokenizer=tokenizer,
)

# PPO training loop: generate responses, score with reward model, update policy
for batch in dataloader:
    query_tensors = batch["input_ids"]
    response_tensors = ppo_trainer.generate(query_tensors, max_new_tokens=256)

    # Score with your reward model
    rewards = reward_model.score(query_tensors, response_tensors)

    stats = ppo_trainer.step(query_tensors, response_tensors, rewards)

PPO gives you more control (you can swap reward models, do online generation, filter by reward thresholds), but the complexity tax is real. You’re managing three models in GPU memory: the policy, the reference, and the reward model. That’s why DPO is the default recommendation.

Evaluate Your Aligned Model

After DPO training, compare outputs between the base model and the aligned model on a held-out set of prompts. Look for two things: the model should produce responses that match the style and quality of your chosen examples, and it shouldn’t degrade on general tasks.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
from transformers import pipeline

pipe = pipeline("text-generation", model="./dpo-llama3-final", tokenizer=tokenizer)

test_prompts = [
    "Explain backpropagation simply.",
    "What are the trade-offs of microservices?",
    "Write a Python function to find duplicates in a list.",
]

for prompt in test_prompts:
    output = pipe(prompt, max_new_tokens=200, do_sample=True, temperature=0.7)
    print(f"Prompt: {prompt}")
    print(f"Response: {output[0]['generated_text']}")
    print("---")

For quantitative eval, run your aligned model on benchmarks like MT-Bench or AlpacaEval. If the win rate against the base model is above 55%, your DPO training is working. Below 50% means you’ve made the model worse – check your data quality and beta.

Common Errors

`ValueError: Could not find columns 'chosen' and 'rejected'`

Your dataset doesn’t have the expected column names. DPOTrainer looks for prompt, chosen, and rejected by default. Rename your columns:

1
2
3
4
5
dataset = dataset.rename_columns({
    "preferred": "chosen",
    "dispreferred": "rejected",
    "instruction": "prompt",
})

CUDA OOM During DPO Training

DPO holds two copies of the model in memory (the policy and a frozen reference). With a 7B model, that’s ~28GB in bf16. Fix it by combining 4-bit quantization with LoRA (shown in the training script above), or reduce max_length and batch size. gradient_checkpointing=True is essential.

Model Outputs Become Repetitive After Training

This usually means the learning rate is too high or beta is too low. The model over-optimizes on the preference signal and loses diversity. Try: halve the learning rate, increase beta to 0.2-0.5, or reduce the number of training epochs. One epoch is usually enough for DPO.

`RuntimeError: expected scalar type BFloat16 but found Float`

Mixed precision mismatch between the model and LoRA adapters. Make sure both use the same dtype:

1
2
3
4
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,  # Must match training args
)

And set bf16=True in your DPOConfig. If your GPU doesn’t support bfloat16 (pre-Ampere), use torch.float16 and fp16=True instead.

Reference Model Mismatch

DPOTrainer automatically creates a frozen copy of your model as the reference. If you pass a separately loaded ref_model that was trained differently, the KL penalty becomes meaningless and training diverges. Unless you have a specific reason, let DPOTrainer handle the reference model automatically.

DPO vs RLHF: Pick DPO Unless You Have a Good Reason Not To#

Prepare Your Preference Dataset#

Train with DPOTrainer#

Key Hyperparameters#

How RLHF with PPO Works (For Comparison)#

Evaluate Your Aligned Model#

Common Errors#

ValueError: Could not find columns 'chosen' and 'rejected'#

CUDA OOM During DPO Training#

Model Outputs Become Repetitive After Training#

RuntimeError: expected scalar type BFloat16 but found Float#

Reference Model Mismatch#

Related Guides#

About the Author