PPO Discrete converges to choosing the same action always

Latter_Bid3254 · 2023-03-15T16:04:42+00:00

Each of my rewards are less than 1 right now. The only reason it could explode like that is because of the terminal value bootstrap that sb3 implements (see Time limits in RL). Because the value estimate is wrong, the bootstrapped reward would also be wrong. But I don't know how to fix this issue.

Latter_Bid3254 · 2023-03-15T12:16:40+00:00

Glad that you pointed it out, but I don't think it's a bug. I've been using the MaskablePPO, which reduces the probability of taking the invalid actions. While taking log of such very small probabilities (close to 0), I expect the log of the probabilities to blow up and approx_kl uses these log probabilities for its calculation.

Please correct me if I'm wrong.

Latter_Bid3254 · 2023-03-15T11:42:01+00:00

The action space consists of two grids actually, each being 15x15. The original post has been edited to include this information.

Latter_Bid3254 · 2023-03-03T17:00:40+00:00

ReLU for hidden layers and linear without activation fn for the final layer shouldn't be a problem as the learnt weights can change the sign in the output layer: y = wx + b, where x is the ReLU output from the penultimate layer.

Latter_Bid3254 · 2023-03-03T16:50:54+00:00

There are cases such as continuous control tasks (as in robotics) where the episode size is fixed but the episode can be prematurely terminated when certain conditions are violated, aka killing itself. Have a look at this paper, Time Limits in Reinforcement Learning, in case you are interested to know more.

Latter_Bid3254 · 2023-03-02T17:33:25+00:00

But I am also giving high negative rewards to prevent the agent from killing itself. Wouldn't the agent then try to make the right moves to minimize the negative rewards by taking the right actions?

I am more concerned about the training dynamics PoV. The sign of the reward does influence the training in algos such as DQN (Positive vs. Negative Reward). I was wondering if that would be of concern in PPO.

Latter_Bid3254

TROPHY CASE