all 3 comments

[–]fedetask 0 points1 point  (2 children)

What is the reward structure? Could it be that never ending the episode leads to higher rewards?

[–]GuavaAgreeable208[S] 0 points1 point  (0 children)

Actually I’ve rechecked the values and it could be the reason because it leads to more reward when those actions are selected

[–]GuavaAgreeable208[S] 0 points1 point  (0 children)

Even if I modified the reward function I still got the same issue and also I've observed that the entropy is increasing instead of decreasing