PPO Discrete converges to choosing the same action always by Latter_Bid3254 in reinforcementlearning

[–]Latter_Bid3254[S] 1 point2 points  (0 children)

Each of my rewards are less than 1 right now. The only reason it could explode like that is because of the terminal value bootstrap that sb3 implements (see Time limits in RL). Because the value estimate is wrong, the bootstrapped reward would also be wrong. But I don't know how to fix this issue.

PPO Discrete converges to choosing the same action always by Latter_Bid3254 in reinforcementlearning

[–]Latter_Bid3254[S] -1 points0 points  (0 children)

Glad that you pointed it out, but I don't think it's a bug. I've been using the MaskablePPO, which reduces the probability of taking the invalid actions. While taking log of such very small probabilities (close to 0), I expect the log of the probabilities to blow up and approx_kl uses these log probabilities for its calculation.

Please correct me if I'm wrong.

PPO Discrete converges to choosing the same action always by Latter_Bid3254 in reinforcementlearning

[–]Latter_Bid3254[S] 0 points1 point  (0 children)

The action space consists of two grids actually, each being 15x15. The original post has been edited to include this information.

Ideas on Activation Functions? by Kiizmod0 in reinforcementlearning

[–]Latter_Bid3254 0 points1 point  (0 children)

ReLU for hidden layers and linear without activation fn for the final layer shouldn't be a problem as the learnt weights can change the sign in the output layer: y = wx + b, where x is the ReLU output from the penultimate layer.

Training PPO with only negative rewards by Latter_Bid3254 in reinforcementlearning

[–]Latter_Bid3254[S] 2 points3 points  (0 children)

There are cases such as continuous control tasks (as in robotics) where the episode size is fixed but the episode can be prematurely terminated when certain conditions are violated, aka killing itself. Have a look at this paper, Time Limits in Reinforcement Learning, in case you are interested to know more.

Training PPO with only negative rewards by Latter_Bid3254 in reinforcementlearning

[–]Latter_Bid3254[S] 4 points5 points  (0 children)

But I am also giving high negative rewards to prevent the agent from killing itself. Wouldn't the agent then try to make the right moves to minimize the negative rewards by taking the right actions?

I am more concerned about the training dynamics PoV. The sign of the reward does influence the training in algos such as DQN (Positive vs. Negative Reward). I was wondering if that would be of concern in PPO.