Path: blob/master/deep_learning/llm/rlhf/dpo.ipynb
2611 views
Direct Preference Optimization (DPO)
A typical process of training modern LLM involves an un/self-supervised training stage, followed by an instruction tuning stage. The instruction tuning phase trains LLM on higher quality of instructions to completion datasets. This helps conform the model's outputs with desired behaviors or tasks, making it more reliable and effective for various applications. Despite the success of instruction tuning, relative judgements of response quality are often times easier to collect, thus subsequent LLM works have a so called alignment stage, which tunes LLMs with preference dataset and reinforcement learning based algorithms, a.k.a. RLHF (Reinforcement Learning with Human Feedback).
In Reinforcement Learning from Human Feedback (RLHF), our objective can be written as:
This objective function optimizes our LLM policy to maximize the expected reward for the questions/prompts sampled from our dataset and the answers generated by LLM policy . The reason we can't just run gradient descent on this objective function and resort to RL algorithms is because the output are sampled from LLM using various decoding strategies (greedy, beam search, etc.). All of which are not differentiable.
At the same time, it seeks to minimize Kullback-Leibler (KL) divergence between the LLM policy and the original reference policy , weighted by a factor . Higher means less deviation from the reference model. This second term is added to prevent "reward hacking," a situation where the LLM generates sequences of tokens that achieve high reward scores but may be nonsensical.
Given this objective function, RLHF methods usually involves fitting a reward model to dataset of human preferences and then use RL to optimize a language model policy to produce responses assigned high reward without drifting excessively far from the original model.
Suppose we have a preference dataset at hand, we can convert these preferences into a score by modeling it via the Bradley-Terry model.
where:
represents probability that the first answer () is better than the second answer () in a paired comparison.
The numerator is the (hidden) reward function that evaluates the first answer's quaility for an input prompt . Similarly, for the second answer, we have .
The probability is computed by taking the ratio of first answer's reward over the sum of rewards for both answers. This normalization ensures we result in a valid probability distribution.
We can parameterize the reward function and estimate its parameter via maximum likelihood, which boils down to framing it as a binary classfication problem:
Where is the logistic function .
Note, we can show the two are connected. Assuming and
While RLHF is capable of producing models with impressive capabilities (supposedly one of the secret sauces behind ChatGPT), its pipeline can be consideraby more complex than supervised learning. Involving training multiple LMs and sampling from a LLM policy as part of the training loop. One of the key insights in Direct Preference Optimization (DPO) [9] is replacing this complex process in RLHF with a supervised learning algorithm that implicitly optimizes the same objective as RLHF.

The DPO paper shows an optimal solution to this optimization problem is:
where:
While an exact solution exists, it is hard to utilize in practice. entails we would have to compute all possible answers that can be generated by our LLM, making it computationally intractable.
The trick is to re-arrange the above term to be based on the reward function:
Given this reward function expression, we can now plug it back into Bradley-Terry model expression:
With the Bradley-Terry model depending only on reward differences beteen the two completions, the computationally expensive term canels out. With the final DPO loss being:
And with all that, we now have the probability of human preference data in terms of our optimal policy rather than a reward model.
Similar to existing algorithms, Direct Preference Optimization (DPO) uses a preference model, such as the Bradley-Terry model, to evaluate how well a reward function aligns with empirical preference data. However, while RLHF methods trains a reward model based on the preference model and then optimize a policy to maximize that learned reward, DPO takes a different approach. It uses a change of variables to directly define a preference loss as a function of the policy itself. This is DPO's main contribution - an algorithm that can train language models from human preferences through binary cross-entropy objective, discarding the need for reinforcement learning methods.
Implementation
This implementation section is comprised of two parts. We'll roll out the DPO loss calculations ourselves, as well as showcase how to leverage trl library's DPOTrainer [3] [5]. We'll be leveraging Ultrafeedback [10] as our dataset. Ultrafeedback is a synthetic preference dataset collected via LLMs. At a high level, the authors compiled 60K diverse instructions, prompted a pool of distinct models at different capability levels for generating completions, and finally leveraged GPT-4 for annotating completion pairs.
For DPO, we need two copies of the model, one serving as the model/policy we wish to optimize, and another for reference model, hence memory requirement will be higher than supervised fine-tuning.
DPOTrainer
DPO trainer offers several metrics related to rewards for us to monitor as part of its training process:
rewards/chosen: for the chosen responses, it measures the mean difference between policy model and reference model's log probabilities, scaled by beta
rewards/rejected: Same as above but for rejected responses.
rewards/accuracies: how often rewards/chosen is greater than its corresponding rewards/rejected.
rewards/margins: the mean difference between the chosen and corresponding rejected rewards.
We can also compare final result using LLM as a parwise judge. In our case, this boils down to comparing the answer generated by the original base/instruct model with the model that was aligned using DPO.
llm_judge_responses.parquet shows pairwise comparison of a Qwen 2.5 3B instruct reference model versus the reference model being aligned with DPO on ultra feedback dataset.
3B model was trained via
dpo_train.py(took approx. 1 hour on 16 A100)response generated via
generate.pyLLM based judgements curated via
claude_judge.py. LLM as a judge script is largely following the concept introduced as part of a separate notebook, except we call claude through AWS bedrock's API for obtaining judge response.
Since DPO was introduced, subsequent works such as IPO, KPO have attempted to further improve upon it. Though, thorough comparisons on these approaches thus far seems to indicate that they all offer similar performance when hyperparameters scans are properly conducted [4].
Reference
[1] Youtube: Direct Preference Optimization (DPO) explained
[2] Youtube: CS 285: Eric Mitchell: Reinforcement Learning from Human Feedback: Algorithms & Applications
[3] Huggingface Blog: Fine-tune Llama 2 with DPO
[4] Huggingface Blog: Preference Tuning LLMs with Direct Preference Optimization Methods
[5] RLHF in 2024 with DPO & Hugging Face
[6] Unveiling the Hidden Reward System in Language Models: A Dive into DPO
[7] Github: DPO - Direct Preference Optimization
[8] Deriving DPO’s Loss
[9] Rafael Rafailov, Archit Sharma, Eric Mitchell et al. - Direct Preference Optimization: Your Language Model is Secretly a Reward Model (2023)
[10] Ganqu Cui, Lifan Yuan, et al. - UltraFeedback: Boosting Language Models with Scaled AI Feedback (2023)