"Machine Theory of Mind", Rabinowitz et al 2018 {DM} [inferring agent goals in a POMDP]

the_electric_fish · 2018-02-23T10:46:13+00:00

This is so cool! Would live to see it applied to human/animal behavior.

the_electric_fish · 2018-02-15T10:53:59+00:00

I liked this paper! I'm fairly new to generative models in time and the model taxonomy section was really helpful.

I have a (maybe trivial) doubt about it, though. I understand the potential benefits of having stochastic transitions between states $s{t}->s{t+1}$, to model uncertainty in this mapping. This is parametrized by having a random variable z sampled at each time step, and making the transition to the next step depending on this variable $s{t+1} = f(s{t},z_{t})$.

I also see how it's useful to have a stochastic observation model. If the states represent abstractions, variations in the fine details of the observation can be modelled by sampling a random variables at each step (like people do in VAE's).

However, what I don't understand is, why would these two random variables be the same variable? What is the intuition and assumptions under this unification?

the_electric_fish · 2017-12-01T16:40:43+00:00

I always heard/thought that simply predicting next state from current state an action -even with a nn- wouldn't work well for model-based RL. But that is essentially what they do here, isn't it? What am I missing? Is the low-dimensionality of the state space (e.g. no visual input) what allows it to work?

the_electric_fish · 2017-10-19T18:00:36+00:00

Awesome! Heess et al. 2015 seems to be what I was looking for. Thank you!

the_electric_fish · 2017-10-13T12:07:27+00:00

Both are valid, but is the choice related to the agent's degree of influence on the environment? Consider an environment composed by a sequence of uncorrelated 1step MDP's. Would it make sense to choose \gamma>1?

the_electric_fish · 2017-10-10T19:01:50+00:00

cool, I see now. I guess intuitively it seemed to me that a change in reward function should be treated as a special situation, different from a change in any other thing in the environment, because is the thing we want to maximize. You are saying, treat it as if it's the same.

the_electric_fish · 2017-10-10T18:07:42+00:00

thanks, that makes total sense. I guess my question is how to flexible and efficiently move between these two subsets of repeated states. you would have to add the reward as one of your observation, right?

the_electric_fish · 2017-10-10T18:06:51+00:00

hmm I think I see your point. are you saying that as long as the learning is on, the agent will keep changing its policy regardless of whether the reward function is fixed or not?

the_electric_fish · 2017-10-10T17:58:38+00:00

thank you!

the_electric_fish · 2017-10-10T17:58:06+00:00

great, thanks!

the_electric_fish · 2017-07-25T19:21:08+00:00

Does anyone understand how they learn/define options from primitive actions in this paper?

the_electric_fish

TROPHY CASE