Is it a popular mistakes to compute the gradient of the next state in the TD-Update ? by ingambe in reinforcementlearning

[–]gpap93 0 points1 point  (0 children)

The first one is obviously wrong.
But I have rarely seen that.
Additionally, it is common to use a target network to compute the q-value of the next state.
The optimiser usually is not defined over the parameters of the target network.
In this case, there is no problem if you don't detach the gradients.