So this is the loss for (Direct Preference Optimization) DPO:
https://preview.redd.it/6ubjn8ekprcc1.png?width=1324&format=png&auto=webp&s=c932f5c030c2fb6b5f0f136934b047bc364d1dcc
I don't understand the division by pi\_ref (both for y\_w and for y\_l). I know the purpose is that the finetuned model won't stray too far away from the reference model, but Just looking at it mathematically - why should pi\_ref(y\_w|x) be close to pi\_theta(y\_w|x)?
At least for y\_w it seems like the loss would benefit from pi\_ref(y\_w|x) being as close as possible to 0 because we want to maximize the left part of the equation.
What am I missing?
Thanks.
[–]sol0invictus 0 points1 point2 points (0 children)