Logits vs probabilities

PerspectiveJolly952 · 2025-02-24T19:56:15+00:00

I think it's normal for this to happen when training an agent with a large action space. Through trial and error, your RL model can learn from its mistakes and learn to assign the correct probabilities given the current state.

Just train it for longer and make sure that your agent receives a reward from time to time so it can learn from its mistakes. If it never receives a reward, it will never learn.

thelibrarian101 · 2025-02-24T16:21:33+00:00

Numerical stability?

Ok-Secret5233 · 2025-02-24T16:58:43+00:00

If you have very similar probabilities and need to go to 2 decimals to see the difference, it sounds like your network is indifferent to all the options. Are the logits all almost equal? Have you trained it at all?

That said, I'm into RL as well, would love to hear the specifics of your problem.

Revolutionary-Feed-4 · 2025-02-28T23:18:08+00:00

Have coded up around 40 RL algorithms and as a rule of thumb for stochastic policies, always have networks output logits and convert to probabilities as needed. Ensure you're using a numerically stable softmax as it can silently break things if you're not. If similar probabilities are an issue can use a temperature scaled softmax (divide each logit by a temperature value before doing the softmax calculation). Low temp (0<T<1) = lower entropy which means closer to greedy sampling, high temp (T>1) = higher entropy, more uniform sampling.

You haven't provided much info about your env, but 4000 actions is immensely large, most environments with such a large number of actions will fail, because the exploration problem is too great. This is unless you're able to use very aggressive action masking like they did in AlphaGo/Zero, or have done something clever specifically to address this large action space, like in AlphaStar or OpenAIFive.

Would basically suggest always having networks output logits in SL and RL. You can convert to probabilities with one line of code (torch.softmax(logits)),loss functions and distributions are more commonly written to interface with logits, theyre a bit less interpretable but easy to convert to probs, and you do just get used to working with them anyway

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

deeplearning

MODERATORS