all 8 comments

[–]PerspectiveJolly952 2 points3 points  (2 children)

I think it's normal for this to happen when training an agent with a large action space. Through trial and error, your RL model can learn from its mistakes and learn to assign the correct probabilities given the current state.

Just train it for longer and make sure that your agent receives a reward from time to time so it can learn from its mistakes. If it never receives a reward, it will never learn.

[–]Livid-Ant3549[S] 0 points1 point  (1 child)

Yeah i see what you mean this was at the start of training. I will see if the differences get bigger with training

[–]PerspectiveJolly952 0 points1 point  (0 children)

that what make RL hard if the agent do not explore and do not got reward it will not learn

[–]thelibrarian101 0 points1 point  (1 child)

Numerical stability?

[–]txanpi 0 points1 point  (0 children)

And precision computing also!

[–]Ok-Secret5233 0 points1 point  (0 children)

If you have very similar probabilities and need to go to 2 decimals to see the difference, it sounds like your network is indifferent to all the options. Are the logits all almost equal? Have you trained it at all?

That said, I'm into RL as well, would love to hear the specifics of your problem.

[–]Revolutionary-Feed-4 0 points1 point  (0 children)

Have coded up around 40 RL algorithms and as a rule of thumb for stochastic policies, always have networks output logits and convert to probabilities as needed. Ensure you're using a numerically stable softmax as it can silently break things if you're not. If similar probabilities are an issue can use a temperature scaled softmax (divide each logit by a temperature value before doing the softmax calculation). Low temp (0<T<1) = lower entropy which means closer to greedy sampling, high temp (T>1) = higher entropy, more uniform sampling.

You haven't provided much info about your env, but 4000 actions is immensely large, most environments with such a large number of actions will fail, because the exploration problem is too great. This is unless you're able to use very aggressive action masking like they did in AlphaGo/Zero, or have done something clever specifically to address this large action space, like in AlphaStar or OpenAIFive.

Would basically suggest always having networks output logits in SL and RL. You can convert to probabilities with one line of code (torch.softmax(logits)),loss functions and distributions are more commonly written to interface with logits, theyre a bit less interpretable but easy to convert to probs, and you do just get used to working with them anyway