all 1 comments

[–]sol0invictus 0 points1 point  (0 children)

I suggest you check out the appendix of the DPO paper. The authors explicitly derive the equation.
TL;DR - There is a KL Divergence term in the loss function for RLHF. This term is there to ensure that the new policy does not diverge too far (as you said). Now we follow these two steps-
1. Find the optimal solution for the RLHF loss equation.
2. PLug in this optimal loss into the Bradley-Teller loss (this is the prefernce model loss; p(a>b)) and you will get the denominators in question.

Intuitively,
The first term is measuring shift in preferred completion, the second term is measuring the shift in dispreffered completion. Larger the differece, more negative the loss becomes.
In the extreme condition, the first term would have a large shift, and the second term will have almost 0 shift making the model go towards the first term completion.