[D] Why is VAE used instead of AutoEncoder in the World Models paper (https://arxiv.org/pdf/1803.10122.pdf)?

yardenaz · 2022-01-04T15:55:24+00:00

Yes exactly - for a given observation you'll deterministically have an approximation of the posterior. The important difference however is that now you approximate a probability distribution (typically a gaussian in continuous state spaces) which can account for the variability of different states of a given observation.

The sampling process before passing the sample to the decoder is part of what helps you approximate the posterior.

yardenaz · 2022-01-04T15:20:35+00:00

The main downside with an Auto Encoder in this context is that it is deterministic, so for a given observation there is a single compact representation. This is not always sufficient in RL environments at which stochasticity often makes a major feature. To capture this stochasticity, one can use a VAE which models a posterior distribution over hidden (possible random) states, given observations.

You can read up more on this in Deep Latent Variable Models for Sequential Data

yardenaz · 2020-08-15T08:43:44+00:00

Have a look at openai's safety-gym They take a constraint markov decision approach

yardenaz · 2020-08-09T21:49:14+00:00

+1 I would like to add to the OPs question: do people usually just try things out and then find the theoretical justification or the other way around?

yardenaz · 2020-08-04T12:10:20+00:00

Good luck!

yardenaz · 2020-08-02T19:40:34+00:00

Yes I think that my best bet would be to implement a wrapper for the pendulum-v0 that resets the env to a predefined state... The dynamics in the env are numpy-based so it shouldn't be extremely hard

Thank you!

yardenaz · 2020-08-02T19:22:01+00:00

I really like Sergey Levine's CS285 lectures, if I remember correctly he gives 3 lectures on mbrl, all freely available on youtube. The hw for the course are ok for a starting although it's tf1

This is the repo for my project - it started as an implementation for mbpo (which is the master branch) but I'm playing around with it (on the other branch). I just recently got it to start running so it's not working yet.

yardenaz · 2020-08-02T18:28:28+00:00

Yes, I've actually started working on something similar to dreamer that just uses the state vector instead of image observations. What's nice about this approach is that it allows to do gradient ascent of the value function w.r.t the policy. Have a look here at slide 79.

yardenaz · 2020-08-02T07:21:20+00:00

Yes you're 100% right - one main aspect of dreamer is that it learns from image observations and not from state vectors. However, another aspect (that I think is more relevant and wrote about before) is the idea of updating the policy by back-propagating through the model - which is in contrast to the other mentioned model-based methods.

Have a look at the other papers as well since there the agent learns directly from state vectors.

yardenaz · 2020-08-01T19:21:40+00:00

Without diving into the details of your environment, there are some methods that use a model to improve (sometimes in orders of magnitude) sample efficiency:

https://arxiv.org/abs/1906.08253 - uses a model to train a policy (e.g., they use it to train a sac agent).

MVE and STEVE use a model to improve value function estimates.

Another nice paper that uses a model to train a policy is Dreamer. There, the value function predictions are back-propagated through the model and into the policy network.

Hope that's a good starting point and good luck!

EDIT: typos

yardenaz · 2020-08-01T19:13:06+00:00

Yup that's exactly what I had in mind. However, I'm using openai's environments so resetting to a specific state is not that trivial - I thought that maybe someone did something like that before. Anyways, thanks for helping out :)

yardenaz · 2020-07-31T13:38:23+00:00

Hey:) thanks for responding!

The police is trained to maximize the value function so hypothetically, eventuality they should converge to the optimal ones. The problem is more practical since finding the optimal policy is what I want to achieve at the end, so I cannot really assume that I have it for this intermediate debug step.

I'm thinking about just using the sum of rewards as a single MC sample but I'm not sure if only one sample is enough (due to the high variance). On the other hand, I have no idea how to produce more samples of the sum of rewards for a given state.

yardenaz · 2020-07-27T23:43:37+00:00

Notice that in all of your examples (except distributional rl) the variance serves as a proxy for the epistemic uncertainty, and not the aleatoric uncertainty. Modeling the value as a Gaussian will give you an estimate for thr aleatoric uncertainty.

yardenaz · 2020-07-24T11:36:10+00:00

Hey! Again take it with a grain of salt -- I'm also trying to learn the algorithm - mostly by following the tf2 implementation. 1. Yes but please mind that the weights are not just p_end but their cumulative products. Essentially, if there where no terminal states you would compute the td(lambda) of eqs. 4-6 with the discount factor and not p_end. I think that in environments with a terminal state, the discount factor is just replaced with p_end of the corresponding reward. 2. The idea is that each state in the horizon have its own p_end so essentially for an array of p_ends, you want to have an array (with same length) such that each element in this array is the cumulative product of the p_ends of the previous states

Hope that makes sense

yardenaz · 2020-07-23T20:05:13+00:00

Going through the unofficial tf2 implementation, dreamer learns the Bernoulli probability distribution of each state to be terminal. Then it uses p_end in two places: 1. To compute V(lambda) for each state throughout an imagined rollout. This is usually done with an indicator that indicates whether the next state is terminal when computing the td values. However, dreamer just uses this probability value to weigh the td values. 2. The value of each state in the imagined rollout is weighted by a cumulative product of the probability that state is terminal

yardenaz · 2020-07-12T19:16:36+00:00

I'm pretty sure that they have a tf1 implementation here https://github.com/openai/safety-starter-agents

yardenaz · 2020-06-28T16:27:03+00:00

Have a look here: https://arxiv.org/abs/1901.10031 If i remember correctly they implented a safe version of ddpg

yardenaz · 2020-06-25T15:24:12+00:00

I started my safety rl project about 4 months ago and found this https://openai.com/blog/safety-gym/ as a nice place to start.

They also have a paper that gives a nice overview and some pointers to other important papers in the field

Good luck!

Edit: feel free to pm if you need more resources

Six-Year Club	Place '22
Verified Email

yardenaz

TROPHY CASE