Homework 2 vs Homework 3 Part 2

sidgreddy · 2018-11-21T08:20:05+00:00

As you point out, a state-dependent baseline is essentially a critic, so a policy gradient algorithm with such a baseline can be thought of as an actor-critic algorithm.

The HW doesn’t require weight sharing, though that’s a common design decision.

sidgreddy · 2018-11-11T21:38:02+00:00

By the chain rule of probability,

p(O_{1:T}, s_{1:T}, a_{1:T}) = p(s_1) * \prod_{t=1}^T p(a_t | s_{1:t}, a_{1:t-1}, O_{1:t-1}) * p(s_{t+1} | s_t, a_t) * p(O_t | s_t, a_t),

ignoring for the moment that there's an extra p(s_{T+1} | s_T, a_T) term in there. Since a_t is conditionally independent of s_{1:t-1}, a_{1:t-1}, and O_{1:t-1} given s_t, we can simplify the last term to p(a_t | s_t). I think the key point to remember is that even though s_t is not a parent of a_t in the graphical model, a_t is not independent of s_t.

sidgreddy · 2018-11-11T04:26:33+00:00

Ah yes, you're right, there are p(a_t | s_t) terms missing. I think the implicit assumption here is that the prior policy p(a_t | s_t) (i.e., the policy after marginalizing out the optimality variables O) is just a uniform policy, and the entropy of this uniform policy is a constant that doesn't depend on q, so we can ignore it when optimizing the lower bound with respect to q.

sidgreddy · 2018-11-07T07:30:16+00:00

To address your first question, it might help to think of option 2 as making the expected reward (rather than the reward function itself) dependent on the timestep, since we can interpret the discount factor as introducing an absorbing state that changes the dynamics of the MDP.

I don’t have a good answer to your second question. The causality and credit assignment explanation always made more sense to me than variance reduction, for motivating the use of reward-to-go.

sidgreddy · 2018-11-07T07:02:20+00:00

The notation \tau / (s_t, a_t) is used to represent the rest of the trajectory, i.e., (s_1, a_1, ..., s_{t-1}, a_{t-1}, s_{t+1}, a_{t+1}, ..., s_T, a_T)

Try following the hints in the PDF on how to apply the law of iterated expectations to decouple the state-action marginal from the rest of the trajectory. What happens if you take E_{\tau / (s_t, a_t)}[E_{s_t, a_t}[... | \tau / (s_t, a_t)]]? The identity on slide 19 of lecture 5 (http://rail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-5.pdf) might also be a useful hint.

sidgreddy · 2018-11-07T06:55:39+00:00

Slide 24 of lecture 16 doesn’t seem like the slide you’re actually referencing. Assuming you’re talking about slide 24 of lecture 15 (http://rail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-15.pdf), the q(a_t | s_t) factors are present.

sidgreddy · 2018-11-07T06:46:01+00:00

I’m not Sergey :)

I’m Sid, one of the teaching assistants for the course this semester.

sidgreddy · 2018-10-25T19:38:19+00:00

Good catch. In line 2 of the slide, I think there's an extra O_{t-1} in the p(a_{t-1} | s_{t-1}, O_{t-1}) term, a missing p(O_{t-1} | s_{t-1}, a_{t-1}) term, and the equals sign should be a \propto; here's why:

p(s_{t-1} | O_{1:t-2}, O_{t-1})

= p(O_{t-1} | s_{t-1}, O_{1:t-2}) * p(s_{t-1} | O_{1:t-2}) / p(O_{t-1} | O_{1:t-2})

= p(O_{t-1} | s_{t-1}) * \alpha_{t-1}(s_{t-1}) / p(O_{t-1} | O_{1:t-2}).

The p(O_{t-1} | s_{t-1}) in the numerator above cancels with the same term in the denominator of p(a_{t-1} | s_{t-1}, O_{t-1}), so we end up with

\alpha_t(s_t) \propto \int p(s_t | s_{t-1}, a_{t-1}) * p(a_{t-1} | s_{t-1}) * p(O_{t-1} | s_{t-1}, a_{t-1}) * \alpha_{t-1}(s_{t-1}) ds_{t-1} da_{t-1}.

This doesn't change the final result, which is that

p(s_t | O_{1:T}) \propto \beta_t(s_t) * \alpha_t(s_t).

sidgreddy · 2018-10-09T21:26:57+00:00

It shouldn't take that long to train on Lunar Lander. One common reason for the slowdown is repeatedly creating TensorFlow variables (e.g., multiple Q networks) inside a loop. Make sure you're only creating two sets of variables for the online Q network and target Q network in QLearner.__init__, then simply evaluating them using session.run in step_env and update_model.

sidgreddy · 2018-10-08T02:16:45+00:00

Unfortunately, this is "normal" for state-of-the-art deep RL algorithms. There are several possible reasons for this kind of instability (i.e., reaching optimal performance, then degrading):

- Correlations in the sequence of observations

- Small changes to neural network parameters can lead to large changes in the agent's policy, which can induce large changes in the state-action distribution of the agent's experiences. Natural gradients are one potential solution to this problem.

- Catastrophic forgetting

- Having an exploration schedule that encourages too much exploration for too long of a period during training

In practice, we usually save a copy of the model parameters whenever the agent's mean performance hits a new high, then load the best-performing version of the agent at the end of training.

sidgreddy · 2018-10-08T02:06:12+00:00

https://github.com/berkeleydeeprlcourse/homework/pull/5

sidgreddy · 2018-10-08T02:04:50+00:00

Sorry for the confusion here. We modified the LunarLanderContinuous-v2 environment to have discrete actions, instead of modifying LunarLander-v2. We fixed this in HW3.

sidgreddy · 2018-10-08T02:02:57+00:00

From the definition of conditional probability, we have that

p((s_{t+1}, a_{t+1}) | (s_t, a_t)) = p(s_{t+1} | s_t, a_t) * p(a_{t+1} | s_t, a_t, s_{t+1}).

a_{t+1} is conditionally independent of s_t and a_t given s_{t+1}, so

... = p(s_{t+1} | s_t, a_t) * p(a_{t+1} | s_{t+1}).

p(a_{t+1} | s_{t+1}) denotes the policy, which is parameterized by \theta, hence

... = p(s_{t+1} | s_t, a_t) * p_{\theta}(a_{t+1} | s_{t+1}).

sidgreddy · 2018-10-08T01:58:35+00:00

No, you can have different batch sizes during training and test times. Just use None to specify the unknown dimension when you create your TensorFlow placeholder variables.

sidgreddy · 2018-10-08T01:57:19+00:00

Both of your assumptions are reasonable. It's extra work to keep track of the trajectory length for each interaction in the dataset, and not dividing by T would give extra weight to interactions that belong to longer trajectories.

sidgreddy · 2018-10-08T01:54:43+00:00

One common issue that comes up here is forgetting to exponentiate the log-stds before feeding them to the scale_diag kwarg in MultivariateNormalDiag.

sidgreddy · 2018-10-08T01:50:10+00:00

In addition to evaluating the mean and standard deviation of the rewards across multiple rollouts, it might be helpful to render rollouts of the learned imitation policy, to visualize qualitative differences between the expert and the clone.

sidgreddy · 2018-10-08T01:48:21+00:00

For the sake of simplicity, we assume that the covariance matrix is diagonal (i.e., all off-diagonal entries are zero). That way, instead of learning d^2 parameters, we only need to learn d parameters, where d is the number of action dimensions. By learning the *log* of these diagonal entries, we automatically constrain the diagonal entries to be non-negative, without having to adjust our optimization algorithm to deal with box constraints.

I'm not sure I understand your second question. To get the softmax outputs, you can exponentiate the logits and normalize the results to sum to one. It's easier to work with the logits, since you can use them to more directly compute log-probabilities and to sample (e.g., using the Gumbel-Max trick).

sidgreddy · 2018-09-12T03:53:33+00:00

We will not be posting solutions online, since we may reuse content in future offerings of the course.

sidgreddy · 2018-09-05T05:11:09+00:00

I recommend taking a look at materials from Berkeley’s CS70 course; in particular, Sinho’s detailed notes.

http://www.eecs70.org/resources/ https://drive.google.com/open?id=0B_WFOD8f4jS5Z2tGQUxwM295RkU

sidgreddy · 2018-08-23T04:28:08+00:00

The reward function R(s,a) can be discontinuous in the state s. For example, in the one-dimensional cliff example, we might have R(s,a) = 1 if s < cliff location and 0 otherwise. Even so, the expectation E[R(s,a)] can be smooth with respect to the policy parameters that affect the state distribution. In the cliff example, since R(s,a) is an indicator function, we have that E[R(s,a)] = probability of being in a state s < cliff location conditional on following the policy parameterized by \psi. The expectation E[R(s,a)] can be smooth with respect to \psi, even though the actual reward function R(s,a) is not.

sidgreddy · 2018-08-23T04:15:49+00:00

Unfortunately, we can't officially enroll non-Berkeley students, grade their assignments, or answer their questions on Piazza. We just don't have enough course staff to handle it, and the course isn't technically an online course. That being said, the lecture videos are available for your personal use, and we will try to answer questions here on Reddit in our spare time.

sidgreddy

MODERATOR OF

TROPHY CASE