DPO vs RLHF: Pick DPO Unless You Have a Good Reason Not To
RLHF (Reinforcement Learning from Human Feedback) is the classic approach. You train a reward model on human preference pairs, then use PPO (Proximal Policy Optimization) to optimize the language model against that reward signal. It works – ChatGPT was trained this way – but it’s a three-stage pipeline: supervised fine-tune, train reward model, run PPO. Each stage has its own hyperparameters, instabilities, and failure modes.
DPO (Direct Preference Optimization) collapses that into one step. Instead of training a separate reward model, DPO reformulates the preference optimization problem as a classification loss directly on the policy model. You give it pairs of (chosen, rejected) responses for the same prompt, and it learns to prefer the chosen response. No reward model, no PPO, no value function. One training loop.
In practice, DPO is more stable, easier to tune, and produces comparable results to PPO-based RLHF on most benchmarks. Use DPO unless you specifically need the flexibility of a separate reward model (e.g., for reward model ensembling or online RLHF with live generation).
Prepare Your Preference Dataset
DPO needs triplets: a prompt, a chosen (preferred) response, and a rejected (worse) response. The Hugging Face datasets library expects these as columns.
| |
For real training, you want thousands of preference pairs. Good public datasets include Anthropic/hh-rlhf, OpenAssistant/oasst1, and argilla/ultrafeedback-binarized-preferences. Load them directly:
| |
The dataset needs prompt, chosen, and rejected columns. If your dataset uses different column names, rename them before passing to the trainer.
Train with DPOTrainer
The TRL library from Hugging Face provides DPOTrainer, which handles the entire DPO training loop. Install it first:
| |
Here is a full DPO training script using LoRA for memory efficiency:
| |
Key Hyperparameters
beta is the most important DPO hyperparameter. It controls the KL divergence penalty against the reference model. Higher beta (0.5+) keeps the model close to the base – safer but smaller alignment gains. Lower beta (0.05-0.1) allows bigger behavioral shifts but risks degrading general capabilities. Start at 0.1 and adjust based on eval.
learning_rate should be much lower than supervised fine-tuning. DPO is sensitive to learning rate – 5e-7 to 5e-6 is the typical range. If your model collapses (starts generating repetitive or degenerate text), halve the learning rate first.
max_length and max_prompt_length control sequence truncation. The response portion is max_length - max_prompt_length. Make sure your preference data fits within these limits or you’ll silently lose training signal from truncated examples.
How RLHF with PPO Works (For Comparison)
If you do need the full RLHF pipeline, TRL supports that too. The workflow has three steps.
First, supervised fine-tune (SFT) a base model on instruction-following data. Second, train a reward model on preference pairs. Third, run PPO to optimize the SFT model against the reward model.
| |
PPO gives you more control (you can swap reward models, do online generation, filter by reward thresholds), but the complexity tax is real. You’re managing three models in GPU memory: the policy, the reference, and the reward model. That’s why DPO is the default recommendation.
Evaluate Your Aligned Model
After DPO training, compare outputs between the base model and the aligned model on a held-out set of prompts. Look for two things: the model should produce responses that match the style and quality of your chosen examples, and it shouldn’t degrade on general tasks.
| |
For quantitative eval, run your aligned model on benchmarks like MT-Bench or AlpacaEval. If the win rate against the base model is above 55%, your DPO training is working. Below 50% means you’ve made the model worse – check your data quality and beta.
Common Errors
ValueError: Could not find columns 'chosen' and 'rejected'
Your dataset doesn’t have the expected column names. DPOTrainer looks for prompt, chosen, and rejected by default. Rename your columns:
| |
CUDA OOM During DPO Training
DPO holds two copies of the model in memory (the policy and a frozen reference). With a 7B model, that’s ~28GB in bf16. Fix it by combining 4-bit quantization with LoRA (shown in the training script above), or reduce max_length and batch size. gradient_checkpointing=True is essential.
Model Outputs Become Repetitive After Training
This usually means the learning rate is too high or beta is too low. The model over-optimizes on the preference signal and loses diversity. Try: halve the learning rate, increase beta to 0.2-0.5, or reduce the number of training epochs. One epoch is usually enough for DPO.
RuntimeError: expected scalar type BFloat16 but found Float
Mixed precision mismatch between the model and LoRA adapters. Make sure both use the same dtype:
| |
And set bf16=True in your DPOConfig. If your GPU doesn’t support bfloat16 (pre-Ampere), use torch.float16 and fp16=True instead.
Reference Model Mismatch
DPOTrainer automatically creates a frozen copy of your model as the reference. If you pass a separately loaded ref_model that was trained differently, the KL penalty becomes meaningless and training diverges. Unless you have a specific reason, let DPOTrainer handle the reference model automatically.
Related Guides
- How to Fine-Tune LLMs with LoRA and Unsloth
- How to Fine-Tune LLMs on Custom Datasets with Axolotl
- How to Fine-Tune Embedding Models for Domain-Specific Search
- How to Build Prefix Tuning for LLMs with PEFT and PyTorch
- How to Build a Knowledge Graph from Text with LLMs
- How to Distill Large LLMs into Smaller, Cheaper Models
- How to Build Prompt Versioning and Regression Testing for LLMs
- How to Build Multi-Turn Chatbots with Conversation Memory
- How to Manage Long Context Windows and Token Limits in LLM Apps
- How to Write Effective System Prompts for LLMs