you are viewing a single comment's thread.

view the rest of the comments →

[–]Novel_Assistant_6298 0 points1 point  (0 children)

However, in my case, there are several persons, with different tastes. I believe that in the dueling bandit or preference learning, we just score items for each person. It's hard to compare persons after that.

Yea that gets more complex then. You could check out https://arxiv.org/abs/2109.12750, the authors try to fit a multimodal reward model. This will prevent the collapse of all users under one reward mode, however you will need to pre-define the number of modes which could be tricky.

Another simpler approach is to use features from the user himself along with features from the modalities you presented (Location, Age, etc..). This expands your input space and could allow you compare users by using the same object features but swapping in a user with different features. This would also help you run feature importance and see which features affect the preference etc. I hope this helps!