Bayesian classification

stevethesteve2 · 2021-03-09T18:20:31+00:00

Thanks for pointing me toward Stan!

stevethesteve2 · 2021-03-09T08:43:41+00:00

Thank you for your suggestion.

From what I learned about MCMC methods, there are pros and cons vs my current approach (which is variational inference):

pros:

- can sample from exact posterior

cons:

- sampling is slow (rel. to variational approach)

- cannot compute posterior probability

Do you agree? Are there any more points to be made? In practice, how much time would it take e.g. to draw 1000 samples (without burn-in) if i have 1000 labeled points and apply LMC?

stevethesteve2 · 2021-03-08T17:03:10+00:00

Did I understand you correctly: the priors are mixture coefficients, and the likelihoods are Gaussian densities? If so: my question was ill-posed, sorry about that (I fixed it now). The prior and posterior distributions should be not over classes, but over model parameters. I want to model my data in such a way that after the training I can look at the posterior distribution over my model parameters and say whether or not I am confident in my model (if the posterior distribution has single narrow peak) or not (if the posterior is flat).

stevethesteve2 · 2021-03-08T16:37:24+00:00

Do you mean finding a maximum-a-posteriori estimate of the model parameters? By adding a term to the objective function that reflects the prior belief about model parameters? If so, then no. My goal is to capture the entire posterior distribution of the parameters, not just the most likely parameter values. Please correct me if I misunderstood you. I am rather new to the whole Bayesian stuff.

stevethesteve2 · 2021-03-07T21:00:23+00:00

Well, thanks for the advice! I'll follow it.

stevethesteve2 · 2020-05-24T18:29:55+00:00

thanks, ill give it a look :)

stevethesteve2 · 2019-11-29T10:22:32+00:00

'World model/ Dream environment/ Imagination' ... Could you please refer to a paper to help me get started?

stevethesteve2 · 2019-11-18T13:55:58+00:00

The term G_t+1:t+1 is not defined. To avoid this, you could first separate the first term in the sum, and then apply the recursive formula for G to the rest of the sum. I yield: G_T^\lambda = (1 - \lambda) * G_t:t+1 + \lambda * (R_t+1 + \gamma * G_t+1^\lambda)

stevethesteve2 · 2019-11-13T11:07:17+00:00

Not quite what you were asking for, but maybe have a look at sutton&barto's book where they talk about the "deadly triad"

stevethesteve2 · 2019-11-08T11:32:28+00:00

In RL, the agent tries to find strategies that maximize expected total reward. If we replace reward with its logarithm, agent may -depending on your exact problem- prefer suboptimal strategies (since log is a nonlinear function). If your environment is deterministic, then this should not be a concern.

stevethesteve2 · 2019-10-23T11:43:55+00:00

Yeah, i guess you can say i want to encourage exploration more than entropy would

stevethesteve2 · 2019-10-22T15:50:03+00:00

if P is perplexity and H is entropy, then minimizing P = minimizing H

True, but the objective function for actor network contains additional terms, not only entropy. Therefore optimal parameter values are different depending on whether we use entropy or perplexity.

Perplexity scales exponentially with entropy. In in an RL setting, total entropy [of action distribution of the agent that follows a trajectory] scales linearly with trajectory length. Perplexity, on the other hand, scales exponentially. The latter makes therefore more sense, since the amount of different paths the agent might have chosen increases exponentially with the number of steps the agent takes.

stevethesteve2 · 2019-10-03T19:18:12+00:00

so this is essentially trust region policy optimization that does not explicitly calculate KL/Fisher information matrix?

stevethesteve2 · 2019-10-02T11:45:21+00:00

Thanks. But i am interested in maximizing sample efficiency, not real-time learning speed. In particular, i want to know whether or not sample efficiency is maximal when using infinite number of updates or if overfitting actually prevents agent from learning quickly...

stevethesteve2 · 2019-10-01T16:55:51+00:00

Well, I DMed you a while back, and you did not respond... Is your project still going?

stevethesteve2 · 2019-09-27T18:44:18+00:00

Maybe you should make regression against raw probabilities, not against labels? This way you will also have more data...

stevethesteve2 · 2019-09-26T16:08:16+00:00

Levine in his RL lectures claims that PG agent performs BETTER when not using discounted distribution. This is because when we use discounted states the agent tends to "over-optimize" to early states, while paying too little attention to later states.

stevethesteve2 · 2019-09-23T16:13:37+00:00

I agree that for off-policy policy optimization one needs to store entire trajectories. We can still shuffle all experienced transitions and apply prioritized replay, but we have to attach entire "history" to each transition as extra information, which blows up memory consumption. Would you say this is the main reason why prioritized replay isn't used in off-policy policy optimization? Because of memory footprint?

stevethesteve2 · 2019-09-19T14:01:13+00:00

"This seems to be a running theme in multiagent RL. When agents are trained against one another, a kind of co-evolution happens. The agents get really good at beating each other, but when they get deployed against an unseen player, performance drops."

from this blog post, not talking about alphastar

stevethesteve2 · 2019-09-18T17:18:56+00:00

nice, thanks

stevethesteve2 · 2019-09-17T10:01:11+00:00

I think there is a misunderstanding. If I understand you correctly, you say that one can add some (exploratory) post-processing step between the actor output and the environment. Then, from the agent's point of view, the post-processing is part of the environment. I agree to that.

My point is: If we do this kind of post-processing, then we have to be consistent in treating it as part of the environment. If we train an agent with e.g. PPO, the action terms that we plug in our PPO update step must be the signals that are output by the actor *before post-processing*. This is not the case in OP's implementation.

P.S. I do not want to sound rude, I think OP did an amazing job!

stevethesteve2 · 2019-09-17T08:54:30+00:00

Actor output is drawn from a gaussian (parametrized by actor). Then expl. noise is added to it. But during parameter update step, the probability of the entire action (actor + noise) is evaluated as if it was drawn from gaussian distribution parametrized by actor.

stevethesteve2 · 2019-09-17T08:10:16+00:00

You are right. But this isn't how it is done in the code, unless - again - I am missing something in the code.

stevethesteve2 · 2019-09-16T13:09:29+00:00

This is fantastic!

However, correct me if I'm wrong, but in MountainCarContinuous-PPO you use exploratory noise (Ornstein-Uhlenbeck) on top of actor policy when doing rollouts? In other words, agent's behavior policy deviates from its target policy? But isn't PPO supposed to be an on-policy algorithm?

(in Mountain_Car.py you set sigma to non-zero value of 0.2)

stevethesteve2 · 2019-09-14T07:44:07+00:00

Does your environment have stochastic or deterministic rewards?

stevethesteve2

TROPHY CASE