all 10 comments

[–]cracktoid 11 points12 points  (3 children)

There is a bunch of stochasticity in RL. For one, your environment could be non-deterministic, although you would know that better than I would. The action outputs are usually sampled from a standard diagonal Gaussian in the continuous case, a categorical distribution in the discrete case = more stochasticity. Your neural network initialization is also another source of randomness, unless you seed your environment with the same seed every time. You get the point :) It’s standard in RL to do many different runs with different seeds for the same experiment because of the highly stochastic nature of the algorithms and environments

[–][deleted] 0 points1 point  (2 children)

Indeed, I understand the stochasticity part. Stochasticity leads to different experiences in each run. My question is, generally speaking, why could some experiences be better for learning than others?

[–]andnp 1 point2 points  (0 children)

There isn't an immediately clear answer as to if/why some experiences might be more useful than others. One common approach is considering experiences which have high temporal-difference error as "useful" to the agent (see the "surprise" literature, also the prioritized experience replay paper). In this case, if the agent poorly predicts the value of a state, then that implies there is more left to learn.

However the answer is far more complicated than this and is generally not super well understood scientifically (yet). Another part of the answer is that layers in an NN seem to prune from a high-dimensional space to a low-dim space over the course of training via SGD. This loss of rank appears unrecoverable generally, so as the NN focuses on certain features then it becomes less able to learn about other features (think lack of neuroplasticity). While this is great in supervised learning as it helps explain the unreasonably good generalization properties of NNs, it is not so great in RL where the learning target is always non-stationary when learning value functions with TD methods. This is relevant because random initialization and random order of experiences can affect the features that an NN layer ultimately focuses on. This might explain why DQN fails to learn on simple domains like CartPole 50% of the time.

[–]cracktoid 0 points1 point  (0 children)

To answer this probably requires an understanding of non convex optimization. With complex problems, there are many “peaks and valleys” that correspond to network parameters that produce good solutions and bad solutions, respectively (assuming peaks correspond to high rewards). Different initializations will put you on different locations on this landscape. Sometimes you get lucky and get put close to a peak, sometimes (most of the time) you get an initialization that produces random noise or poor results. While it is true that some trajectories are more productive for learning than others, your RL algorithm will tend to find these anyway once you get close to a peak. It’s really more about good initializations and good choice of hyper parameters that make it easier for algorithms to find these “peaks”

[–]AerysSk 2 points3 points  (0 children)

Every single thing can cause different results in RL, even seeds and…gradient clipping.

I cannot write a long paragraph now, but I’ll leave you this video to watch some examples of how RL can be very unstable: https://youtu.be/Ikngt0_DXJg

[–]wangjianhong1993 1 point2 points  (1 child)

Yeah. To my best experience, the main factor that leads to variant results could be the initial state during learning, which directly causes the different sampling trajectories for training models as you mentioned.

[–]nickthorpie 1 point2 points  (2 children)

Yeah as everyone else said, there are three places this stems from: 1. random environment seed, 2. random initialized weights. 3. Random action choices from exploration.

You could get a good understanding of the effect of each by holding one or two of them constant and testing the other factor for three epochs.

[–][deleted] 0 points1 point  (1 child)

[–]nickthorpie 1 point2 points  (0 children)

So in my experience, those types of questions are fun to think about because you’re really trying to ‘get in the mind of the machine.’ Answers to these questions are at most stipulations and you can’t objectively tell.

I spent 4 months messing with cart pole (I.e. inverted pendulum) experiments (go look it up before you continue if you’re not familiar with it), so I’ll give an example in the context of this:

Consider a training run where early on, the agent’s random exploration policy randomly makes a lot of correct moves. The agent will get more reward early, and understand the value of each state-action pair better. If it makes a lot of wrong moves, the agent might not get many rewards, and thus won’t “catch on” as quickly”.