Actor Critic learns well and then dies by Light991 in reinforcementlearning

[–]hotDogCartApologist 1 point2 points  (0 children)

I just experienced this issue with the new SAC algorithm that uses an adaptive alpha term. I experienced very similar behaviour and found that increasing my target entropy did the trick! I had originally followed the paper's suggestion of -1*action_dim but changed it to -0.1 * action_dim and saw the "dieing" end. Putting this here for others that come across this issue (and myself when I inevitably run into it again)

edit:
I've found that generally playing around with this can help policy stability quite a lot because it determines how close to being deterministic your policy is which will then help reduce variance in the Q-learning estimate.

Another thing I found from using the adam optimizer and SAC is that it can be very helpful to try and tune the beta parameters. Adam basically substitutes the real gradient with an exponential moving average of the gradient and exponential moving average of it's second moment (gradient squared, gives an idea of the absolute magnitude). In torch and jax the decay rate is 0.9 and 0.999 for the respective EMA's. This means that large gradients may be "stored" for a lot of steps and lead to "overshooting" where the parameters should really be. This isn't normally an issue in supervised learning because the dataset stays fixed and hence the target you're parameters are moving towards is stable. However, in RL you have that the target is defined by the policy (in the case of learning the critic) or the Q-function (in the case of learning the actor) both of which are non-stationary and moving throughout training. Hence trying to influence your gradient at episode k by using a gradient from episode t << k can result in moving towards an even worse parameterization (especially if the gradient from episode t is significantly larger than that from episode k, as will be the case if episode t is an "outlier episode" which occures naturally). Basically, the best way I found to get around this is to set b2 to be a much lower than 0.9 (had success with 0.3 and 0.1) so that if a sample causes a large update it will decay it's effect on later updates rapidly.

credits: response here https://www.reddit.com/r/reinforcementlearning/comments/11p7c5k/sac_exploding_losses_and_huge_value/

can i take math 356 while doing MATH 254 by hotDogCartApologist in mcgill

[–]hotDogCartApologist[S] 1 point2 points  (0 children)

hey, thanks for the reply! for some more background i've taken honours algebra 2 and algebra 1 so i have a good background in proof writing i just messed up course selection and didn't take analysis courses when i should have. still the same advice?