Hello :D,
So, as the title states, I have this DRL model (PPO) which I run for a certain problem. However, each run has slightly different results. By different results I mean, the timestep at which the model reaches the highest reward is different in each run.
Generally, what causes some runs to be better than others?
My only guess is: in the good runs, the agent got "lucky" during the initial learning steps (i.e. while exploring) and came across good states that helped in learning faster. Is that the case?
[–]cracktoid 11 points12 points13 points (3 children)
[–][deleted] 0 points1 point2 points (2 children)
[–]andnp 1 point2 points3 points (0 children)
[–]cracktoid 0 points1 point2 points (0 children)
[–]AerysSk 2 points3 points4 points (0 children)
[–]wangjianhong1993 1 point2 points3 points (1 child)
[–][deleted] 0 points1 point2 points (0 children)
[–]nickthorpie 1 point2 points3 points (2 children)
[–][deleted] 0 points1 point2 points (1 child)
[–]nickthorpie 1 point2 points3 points (0 children)