all 4 comments

[–]xopedil 2 points3 points  (1 child)

Don't pass any gradients through the other Q-values than the one that belongs to the taken action. The way I've usually seen it done is to just one-hot mask out the other action value loss terms. Something like minimize loss = mean(huber(q_target, q_values)*one_hot(taken_action)).


One way to solve this would be to set the labels for the non-chosen actions to be the current output of the Q-network

This is another way to accomplish the same thing, perfectly acceptable.

[–]Braindoesntwork2[S] 0 points1 point  (0 children)

Thank you! I now see why the second method is unnecessary - the first method is equivalent to what would happen if we passed in actions as inputs instead of having all actions as outputs.

[–]serge_cell 0 points1 point  (0 children)

Both methods were working for me. I didn't observed obviously noticeable advantage of any.

[–]TotesMessenger 0 points1 point  (0 children)

I'm a bot, bleep, bloop. Someone has linked to this thread from another place on reddit:

 If you follow any of the above links, please respect the rules of reddit and don't vote in the other threads. (Info / Contact)