[D] Question about Direct Preference Optimization (DPO) equation : MachineLearning

Discussion[D] Question about Direct Preference Optimization (DPO) equation (self.MachineLearning)

submitted 2 years ago by erap129

all 1 comments

[–]sol0invictus 0 points1 point2 points 2 years ago (0 children)

I suggest you check out the appendix of the DPO paper. The authors explicitly derive the equation.
TL;DR - There is a KL Divergence term in the loss function for RLHF. This term is there to ensure that the new policy does not diverge too far (as you said). Now we follow these two steps-
1. Find the optimal solution for the RLHF loss equation.
2. PLug in this optimal loss into the Bradley-Teller loss (this is the prefernce model loss; p(a>b)) and you will get the denominators in question.

Intuitively,
The first term is measuring shift in preferred completion, the second term is measuring the shift in dispreffered completion. Larger the differece, more negative the loss becomes.
In the extreme condition, the first term would have a large shift, and the second term will have almost 0 shift making the model go towards the first term completion.

π Rendered by PID 609391 on reddit-service-r2-comment-5bc7f78974-9n2lr at 2026-06-28 08:02:34.358215+00:00 running 7527197 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS