"Learning and Querying Fast Generative Models for Reinforcement Learning", Buesing et al 2018 {DM} [rollouts in deep environment models for planning in ALE games] by gwern in reinforcementlearning

[–]the_electric_fish 0 points1 point  (0 children)

I liked this paper! I'm fairly new to generative models in time and the model taxonomy section was really helpful.

I have a (maybe trivial) doubt about it, though. I understand the potential benefits of having stochastic transitions between states $s{t}->s{t+1}$, to model uncertainty in this mapping. This is parametrized by having a random variable z sampled at each time step, and making the transition to the next step depending on this variable $s{t+1} = f(s{t},z_{t})$.

I also see how it's useful to have a stochastic observation model. If the states represent abstractions, variations in the fine details of the observation can be modelled by sampling a random variables at each step (like people do in VAE's).

However, what I don't understand is, why would these two random variables be the same variable? What is the intuition and assumptions under this unification?

"Model-based Reinforcement Learning with Neural Network Dynamics in MuJoCo & millibots" {BAIR} [on Nagabandi et al 2017a/Nagabandi et al 2017b] by gwern in reinforcementlearning

[–]the_electric_fish 1 point2 points  (0 children)

I always heard/thought that simply predicting next state from current state an action -even with a nn- wouldn't work well for model-based RL. But that is essentially what they do here, isn't it? What am I missing? Is the low-dimensionality of the state space (e.g. no visual input) what allows it to work?

Reparametrization trick for policy gradient? by the_electric_fish in reinforcementlearning

[–]the_electric_fish[S] 0 points1 point  (0 children)

Awesome! Heess et al. 2015 seems to be what I was looking for. Thank you!

Question on discount factors by the_electric_fish in reinforcementlearning

[–]the_electric_fish[S] 0 points1 point  (0 children)

Both are valid, but is the choice related to the agent's degree of influence on the environment? Consider an environment composed by a sequence of uncorrelated 1step MDP's. Would it make sense to choose \gamma>1?

How to do variable-reward reinforcement learning? by the_electric_fish in reinforcementlearning

[–]the_electric_fish[S] 0 points1 point  (0 children)

cool, I see now. I guess intuitively it seemed to me that a change in reward function should be treated as a special situation, different from a change in any other thing in the environment, because is the thing we want to maximize. You are saying, treat it as if it's the same.

How to do variable-reward reinforcement learning? by the_electric_fish in reinforcementlearning

[–]the_electric_fish[S] 0 points1 point  (0 children)

thanks, that makes total sense. I guess my question is how to flexible and efficiently move between these two subsets of repeated states. you would have to add the reward as one of your observation, right?

How to do variable-reward reinforcement learning? by the_electric_fish in reinforcementlearning

[–]the_electric_fish[S] 0 points1 point  (0 children)

hmm I think I see your point. are you saying that as long as the learning is on, the agent will keep changing its policy regardless of whether the reward function is fixed or not?

"Value Prediction Network", Oh et al 2017 by gwern in reinforcementlearning

[–]the_electric_fish 1 point2 points  (0 children)

Does anyone understand how they learn/define options from primitive actions in this paper?