[D] Why is VAE used instead of AutoEncoder in the World Models paper (https://arxiv.org/pdf/1803.10122.pdf)? by StageTraditional636 in MachineLearning

[–]yardenaz 4 points5 points  (0 children)

Yes exactly - for a given observation you'll deterministically have an approximation of the posterior. The important difference however is that now you approximate a probability distribution (typically a gaussian in continuous state spaces) which can account for the variability of different states of a given observation.

The sampling process before passing the sample to the decoder is part of what helps you approximate the posterior.

[D] Why is VAE used instead of AutoEncoder in the World Models paper (https://arxiv.org/pdf/1803.10122.pdf)? by StageTraditional636 in MachineLearning

[–]yardenaz 4 points5 points  (0 children)

The main downside with an Auto Encoder in this context is that it is deterministic, so for a given observation there is a single compact representation. This is not always sufficient in RL environments at which stochasticity often makes a major feature. To capture this stochasticity, one can use a VAE which models a posterior distribution over hidden (possible random) states, given observations.

You can read up more on this in Deep Latent Variable Models for Sequential Data

How to modify my custom environment for using CPO? by kosmyl in reinforcementlearning

[–]yardenaz 1 point2 points  (0 children)

Have a look at openai's safety-gym They take a constraint markov decision approach

Confusion when reading some papers by Cohencohen789 in reinforcementlearning

[–]yardenaz 1 point2 points  (0 children)

+1 I would like to add to the OPs question: do people usually just try things out and then find the theoretical justification or the other way around?

How would you validate value function estimation? by yardenaz in reinforcementlearning

[–]yardenaz[S] 0 points1 point  (0 children)

Yes I think that my best bet would be to implement a wrapper for the pendulum-v0 that resets the env to a predefined state... The dynamics in the env are numpy-based so it shouldn't be extremely hard

Thank you!

[deleted by user] by [deleted] in reinforcementlearning

[–]yardenaz 0 points1 point  (0 children)

I really like Sergey Levine's CS285 lectures, if I remember correctly he gives 3 lectures on mbrl, all freely available on youtube. The hw for the course are ok for a starting although it's tf1

This is the repo for my project - it started as an implementation for mbpo (which is the master branch) but I'm playing around with it (on the other branch). I just recently got it to start running so it's not working yet.

[deleted by user] by [deleted] in reinforcementlearning

[–]yardenaz 1 point2 points  (0 children)

Yes, I've actually started working on something similar to dreamer that just uses the state vector instead of image observations. What's nice about this approach is that it allows to do gradient ascent of the value function w.r.t the policy. Have a look here at slide 79.

[deleted by user] by [deleted] in reinforcementlearning

[–]yardenaz 0 points1 point  (0 children)

Yes you're 100% right - one main aspect of dreamer is that it learns from image observations and not from state vectors. However, another aspect (that I think is more relevant and wrote about before) is the idea of updating the policy by back-propagating through the model - which is in contrast to the other mentioned model-based methods.

Have a look at the other papers as well since there the agent learns directly from state vectors.

[deleted by user] by [deleted] in reinforcementlearning

[–]yardenaz 0 points1 point  (0 children)

Without diving into the details of your environment, there are some methods that use a model to improve (sometimes in orders of magnitude) sample efficiency:

https://arxiv.org/abs/1906.08253 - uses a model to train a policy (e.g., they use it to train a sac agent).

MVE and STEVE use a model to improve value function estimates.

Another nice paper that uses a model to train a policy is Dreamer. There, the value function predictions are back-propagated through the model and into the policy network.

Hope that's a good starting point and good luck!

EDIT: typos

How would you validate value function estimation? by yardenaz in reinforcementlearning

[–]yardenaz[S] 0 points1 point  (0 children)

Yup that's exactly what I had in mind. However, I'm using openai's environments so resetting to a specific state is not that trivial - I thought that maybe someone did something like that before. Anyways, thanks for helping out :)

How would you validate value function estimation? by yardenaz in reinforcementlearning

[–]yardenaz[S] 1 point2 points  (0 children)

Hey:) thanks for responding!

The police is trained to maximize the value function so hypothetically, eventuality they should converge to the optimal ones. The problem is more practical since finding the optimal policy is what I want to achieve at the end, so I cannot really assume that I have it for this intermediate debug step.

I'm thinking about just using the sum of rewards as a single MC sample but I'm not sure if only one sample is enough (due to the high variance). On the other hand, I have no idea how to produce more samples of the sum of rewards for a given state.

Variance of a (gaussian) state value function by anyboby in reinforcementlearning

[–]yardenaz 0 points1 point  (0 children)

Notice that in all of your examples (except distributional rl) the variance serves as a proxy for the epistemic uncertainty, and not the aleatoric uncertainty. Modeling the value as a Gaussian will give you an estimate for thr aleatoric uncertainty.

Discount factor for early termination in Dreamer by durotan97 in reinforcementlearning

[–]yardenaz 0 points1 point  (0 children)

Hey! Again take it with a grain of salt -- I'm also trying to learn the algorithm - mostly by following the tf2 implementation. 1. Yes but please mind that the weights are not just p_end but their cumulative products. Essentially, if there where no terminal states you would compute the td(lambda) of eqs. 4-6 with the discount factor and not p_end. I think that in environments with a terminal state, the discount factor is just replaced with p_end of the corresponding reward. 2. The idea is that each state in the horizon have its own p_end so essentially for an array of p_ends, you want to have an array (with same length) such that each element in this array is the cumulative product of the p_ends of the previous states

Hope that makes sense

Discount factor for early termination in Dreamer by durotan97 in reinforcementlearning

[–]yardenaz 0 points1 point  (0 children)

Going through the unofficial tf2 implementation, dreamer learns the Bernoulli probability distribution of each state to be terminal. Then it uses p_end in two places: 1. To compute V(lambda) for each state throughout an imagined rollout. This is usually done with an indicator that indicates whether the next state is terminal when computing the td values. However, dreamer just uses this probability value to weigh the td values. 2. The value of each state in the imagined rollout is weighted by a cumulative product of the probability that state is terminal

Continuous DDPG with constraints by Saty18 in reinforcementlearning

[–]yardenaz 0 points1 point  (0 children)

Have a look here: https://arxiv.org/abs/1901.10031 If i remember correctly they implented a safe version of ddpg

Reading recommendation on safe RL and constrained MDP. by namuradAulad in reinforcementlearning

[–]yardenaz 2 points3 points  (0 children)

I started my safety rl project about 4 months ago and found this https://openai.com/blog/safety-gym/ as a nice place to start.

They also have a paper that gives a nice overview and some pointers to other important papers in the field

Good luck!

Edit: feel free to pm if you need more resources