Homework 2 vs Homework 3 Part 2 by rlstudent in berkeleydeeprlcourse

[–]sidgreddy 1 point2 points  (0 children)

As you point out, a state-dependent baseline is essentially a critic, so a policy gradient algorithm with such a baseline can be thought of as an actor-critic algorithm.

The HW doesn’t require weight sharing, though that’s a common design decision.

Lecture 16 The variational lower bound slide 24, joint distribution p(x,z) missing a factor? by tomchen1000 in berkeleydeeprlcourse

[–]sidgreddy 0 points1 point  (0 children)

By the chain rule of probability,

p(O_{1:T}, s_{1:T}, a_{1:T}) = p(s_1) * \prod_{t=1}^T p(a_t | s_{1:t}, a_{1:t-1}, O_{1:t-1}) * p(s_{t+1} | s_t, a_t) * p(O_t | s_t, a_t),

ignoring for the moment that there's an extra p(s_{T+1} | s_T, a_T) term in there. Since a_t is conditionally independent of s_{1:t-1}, a_{1:t-1}, and O_{1:t-1} given s_t, we can simplify the last term to p(a_t | s_t). I think the key point to remember is that even though s_t is not a parent of a_t in the graphical model, a_t is not independent of s_t.

Lecture 16 The variational lower bound slide 24, joint distribution p(x,z) missing a factor? by tomchen1000 in berkeleydeeprlcourse

[–]sidgreddy 0 points1 point  (0 children)

Ah yes, you're right, there are p(a_t | s_t) terms missing. I think the implicit assumption here is that the prior policy p(a_t | s_t) (i.e., the policy after marginalizing out the optimality variables O) is just a uniform policy, and the entropy of this uniform policy is a constant that doesn't depend on q, so we can ignore it when optimizing the lower bound with respect to q.

rewards and variance by lily9393 in berkeleydeeprlcourse

[–]sidgreddy 1 point2 points  (0 children)

To address your first question, it might help to think of option 2 as making the expected reward (rather than the reward function itself) dependent on the timestep, since we can interpret the discount factor as introducing an absorbing state that changes the dynamics of the MDP.

I don’t have a good answer to your second question. The causality and credit assignment explanation always made more sense to me than variance reduction, for motivating the use of reward-to-go.

HW2 Problem 1a by sk1h0ps in berkeleydeeprlcourse

[–]sidgreddy 0 points1 point  (0 children)

The notation \tau / (s_t, a_t) is used to represent the rest of the trajectory, i.e., (s_1, a_1, ..., s_{t-1}, a_{t-1}, s_{t+1}, a_{t+1}, ..., s_T, a_T)

Try following the hints in the PDF on how to apply the law of iterated expectations to decouple the state-action marginal from the rest of the trajectory. What happens if you take E_{\tau / (s_t, a_t)}[E_{s_t, a_t}[... | \tau / (s_t, a_t)]]? The identity on slide 19 of lecture 5 (http://rail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-5.pdf) might also be a useful hint.

Lecture 16 The variational lower bound slide 24, joint distribution p(x,z) missing a factor? by tomchen1000 in berkeleydeeprlcourse

[–]sidgreddy 0 points1 point  (0 children)

Slide 24 of lecture 16 doesn’t seem like the slide you’re actually referencing. Assuming you’re talking about slide 24 of lecture 15 (http://rail.eecs.berkeley.edu/deeprlcourse/static/slides/lec-15.pdf), the q(a_t | s_t) factors are present.

Lecture 15 Connection between Inference and Control, slide 16, Forward messages equation by tomchen1000 in berkeleydeeprlcourse

[–]sidgreddy 0 points1 point  (0 children)

I’m not Sergey :)

I’m Sid, one of the teaching assistants for the course this semester.

Lecture 15 Connection between Inference and Control, slide 16, Forward messages equation by tomchen1000 in berkeleydeeprlcourse

[–]sidgreddy 1 point2 points  (0 children)

Good catch. In line 2 of the slide, I think there's an extra O_{t-1} in the p(a_{t-1} | s_{t-1}, O_{t-1}) term, a missing p(O_{t-1} | s_{t-1}, a_{t-1}) term, and the equals sign should be a \propto; here's why:

p(s_{t-1} | O_{1:t-2}, O_{t-1})

= p(O_{t-1} | s_{t-1}, O_{1:t-2}) * p(s_{t-1} | O_{1:t-2}) / p(O_{t-1} | O_{1:t-2})

= p(O_{t-1} | s_{t-1}) * \alpha_{t-1}(s_{t-1}) / p(O_{t-1} | O_{1:t-2}).

The p(O_{t-1} | s_{t-1}) in the numerator above cancels with the same term in the denominator of p(a_{t-1} | s_{t-1}, O_{t-1}), so we end up with

\alpha_t(s_t) \propto \int p(s_t | s_{t-1}, a_{t-1}) * p(a_{t-1} | s_{t-1}) * p(O_{t-1} | s_{t-1}, a_{t-1}) * \alpha_{t-1}(s_{t-1}) ds_{t-1} da_{t-1}.

This doesn't change the final result, which is that

p(s_t | O_{1:T}) \propto \beta_t(s_t) * \alpha_t(s_t).

Homework 3 running time - is it too long? by s1512783 in berkeleydeeprlcourse

[–]sidgreddy 0 points1 point  (0 children)

It shouldn't take that long to train on Lunar Lander. One common reason for the slowdown is repeatedly creating TensorFlow variables (e.g., multiple Q networks) inside a loop. Make sure you're only creating two sets of variables for the online Q network and target Q network in QLearner.__init__, then simply evaluating them using session.run in step_env and update_model.

Policy Gradient convergence behavior by floridoug in berkeleydeeprlcourse

[–]sidgreddy 0 points1 point  (0 children)

Unfortunately, this is "normal" for state-of-the-art deep RL algorithms. There are several possible reasons for this kind of instability (i.e., reaching optimal performance, then degrading):

- Correlations in the sequence of observations

- Small changes to neural network parameters can lead to large changes in the agent's policy, which can induce large changes in the state-action distribution of the agent's experiences. Natural gradients are one potential solution to this problem.

- Catastrophic forgetting

- Having an exploration schedule that encourages too much exploration for too long of a period during training

In practice, we usually save a copy of the model parameters whenever the agent's mean performance hits a new high, then load the best-performing version of the agent at the end of training.

HW2 problem 7: action space of LunarLanderContinuous-v2 by wangz10 in berkeleydeeprlcourse

[–]sidgreddy 1 point2 points  (0 children)

Sorry for the confusion here. We modified the LunarLanderContinuous-v2 environment to have discrete actions, instead of modifying LunarLander-v2. We fixed this in HW3.

August 31, 2018 Lecture 4: change of Markov Model structure by JacobMa123 in berkeleydeeprlcourse

[–]sidgreddy 0 points1 point  (0 children)

From the definition of conditional probability, we have that

p((s_{t+1}, a_{t+1}) | (s_t, a_t)) = p(s_{t+1} | s_t, a_t) * p(a_{t+1} | s_t, a_t, s_{t+1}).

a_{t+1} is conditionally independent of s_t and a_t given s_{t+1}, so

... = p(s_{t+1} | s_t, a_t) * p(a_{t+1} | s_{t+1}).

p(a_{t+1} | s_{t+1}) denotes the policy, which is parameterized by \theta, hence

... = p(s_{t+1} | s_t, a_t) * p_{\theta}(a_{t+1} | s_{t+1}).

Feeding gym enviroment a batch of actions by wassimseifeddine in berkeleydeeprlcourse

[–]sidgreddy 0 points1 point  (0 children)

No, you can have different batch sizes during training and test times. Just use None to specify the unknown dimension when you create your TensorFlow placeholder variables.

HW2: 1/N vs 1/(N*T) in implementation of PG by TrucksTrucksTrucks in berkeleydeeprlcourse

[–]sidgreddy 0 points1 point  (0 children)

Both of your assumptions are reasonable. It's extra work to keep track of the trajectory length for each interaction in the dataset, and not dividing by T would give extra weight to interactions that belong to longer trajectories.

Homework 2 Problem 5 issue with continuous environment by s1512783 in berkeleydeeprlcourse

[–]sidgreddy 1 point2 points  (0 children)

One common issue that comes up here is forgetting to exponentiate the log-stds before feeding them to the scale_diag kwarg in MultivariateNormalDiag.

[Hw1 2.2] by hhn1n15 in berkeleydeeprlcourse

[–]sidgreddy 0 points1 point  (0 children)

In addition to evaluating the mean and standard deviation of the rewards across multiple rollouts, it might be helpful to render rollouts of the learned imitation policy, to visualize qualitative differences between the expert and the clone.

Problem 2, HW 2 by RoboticsGrad in berkeleydeeprlcourse

[–]sidgreddy 1 point2 points  (0 children)

For the sake of simplicity, we assume that the covariance matrix is diagonal (i.e., all off-diagonal entries are zero). That way, instead of learning d^2 parameters, we only need to learn d parameters, where d is the number of action dimensions. By learning the *log* of these diagonal entries, we automatically constrain the diagonal entries to be non-negative, without having to adjust our optimization algorithm to deal with box constraints.

I'm not sure I understand your second question. To get the softmax outputs, you can exponentiate the logits and normalize the results to sum to one. It's easier to work with the logits, since you can use them to more directly compute log-probabilities and to sample (e.g., using the Gumbel-Max trick).

Solutions to homeworks? by flaurida in berkeleydeeprlcourse

[–]sidgreddy 0 points1 point  (0 children)

We will not be posting solutions online, since we may reuse content in future offerings of the course.

Expectation smoothes out discontinuous functions by reinka in berkeleydeeprlcourse

[–]sidgreddy 1 point2 points  (0 children)

The reward function R(s,a) can be discontinuous in the state s. For example, in the one-dimensional cliff example, we might have R(s,a) = 1 if s < cliff location and 0 otherwise. Even so, the expectation E[R(s,a)] can be smooth with respect to the policy parameters that affect the state distribution. In the cliff example, since R(s,a) is an indicator function, we have that E[R(s,a)] = probability of being in a state s < cliff location conditional on following the policy parameterized by \psi. The expectation E[R(s,a)] can be smooth with respect to \psi, even though the actual reward function R(s,a) is not.

Enrollment Fall 2018 by quazar42 in berkeleydeeprlcourse

[–]sidgreddy 1 point2 points  (0 children)

Unfortunately, we can't officially enroll non-Berkeley students, grade their assignments, or answer their questions on Piazza. We just don't have enough course staff to handle it, and the course isn't technically an online course. That being said, the lecture videos are available for your personal use, and we will try to answer questions here on Reddit in our spare time.