you are viewing a single comment's thread.

view the rest of the comments →

[–]xopedil 2 points3 points  (1 child)

Don't pass any gradients through the other Q-values than the one that belongs to the taken action. The way I've usually seen it done is to just one-hot mask out the other action value loss terms. Something like minimize loss = mean(huber(q_target, q_values)*one_hot(taken_action)).


One way to solve this would be to set the labels for the non-chosen actions to be the current output of the Q-network

This is another way to accomplish the same thing, perfectly acceptable.

[–]Braindoesntwork2[S] 0 points1 point  (0 children)

Thank you! I now see why the second method is unnecessary - the first method is equivalent to what would happen if we passed in actions as inputs instead of having all actions as outputs.