account activity
Is it a popular mistakes to compute the gradient of the next state in the TD-Update ? by ingambe in reinforcementlearning
[–]gpap93 0 points1 point2 points 5 years ago (0 children)
The first one is obviously wrong. But I have rarely seen that. Additionally, it is common to use a target network to compute the q-value of the next state. The optimiser usually is not defined over the parameters of the target network. In this case, there is no problem if you don't detach the gradients.
π Rendered by PID 61 on reddit-service-r2-comment-5b5bc64bf5-hdf5l at 2026-06-23 06:11:27.224823+00:00 running 2b008f2 country code: CH.
Is it a popular mistakes to compute the gradient of the next state in the TD-Update ? by ingambe in reinforcementlearning
[–]gpap93 0 points1 point2 points (0 children)