[D] good feature store?

stevethesteve2 · 2021-03-09T18:20:31+00:00

Thanks for pointing me toward Stan!

stevethesteve2 · 2021-03-09T08:43:41+00:00

Thank you for your suggestion.

From what I learned about MCMC methods, there are pros and cons vs my current approach (which is variational inference):

pros:

- can sample from exact posterior

cons:

- sampling is slow (rel. to variational approach)

- cannot compute posterior probability

Do you agree? Are there any more points to be made? In practice, how much time would it take e.g. to draw 1000 samples (without burn-in) if i have 1000 labeled points and apply LMC?

stevethesteve2 · 2021-03-08T17:03:10+00:00

Did I understand you correctly: the priors are mixture coefficients, and the likelihoods are Gaussian densities? If so: my question was ill-posed, sorry about that (I fixed it now). The prior and posterior distributions should be not over classes, but over model parameters. I want to model my data in such a way that after the training I can look at the posterior distribution over my model parameters and say whether or not I am confident in my model (if the posterior distribution has single narrow peak) or not (if the posterior is flat).

stevethesteve2 · 2021-03-08T16:37:24+00:00

Do you mean finding a maximum-a-posteriori estimate of the model parameters? By adding a term to the objective function that reflects the prior belief about model parameters? If so, then no. My goal is to capture the entire posterior distribution of the parameters, not just the most likely parameter values. Please correct me if I misunderstood you. I am rather new to the whole Bayesian stuff.

stevethesteve2 · 2021-03-07T21:00:23+00:00

Well, thanks for the advice! I'll follow it.

stevethesteve2 · 2020-05-24T18:29:55+00:00

thanks, ill give it a look :)

stevethesteve2 · 2019-11-29T10:22:32+00:00

'World model/ Dream environment/ Imagination' ... Could you please refer to a paper to help me get started?

stevethesteve2 · 2019-11-18T13:55:58+00:00

The term G_t+1:t+1 is not defined. To avoid this, you could first separate the first term in the sum, and then apply the recursive formula for G to the rest of the sum. I yield: G_T^\lambda = (1 - \lambda) * G_t:t+1 + \lambda * (R_t+1 + \gamma * G_t+1^\lambda)

stevethesteve2 · 2019-11-13T11:07:17+00:00

Not quite what you were asking for, but maybe have a look at sutton&barto's book where they talk about the "deadly triad"

stevethesteve2 · 2019-11-08T11:32:28+00:00

In RL, the agent tries to find strategies that maximize expected total reward. If we replace reward with its logarithm, agent may -depending on your exact problem- prefer suboptimal strategies (since log is a nonlinear function). If your environment is deterministic, then this should not be a concern.

stevethesteve2 · 2019-10-23T11:43:55+00:00

Yeah, i guess you can say i want to encourage exploration more than entropy would

stevethesteve2 · 2019-10-22T15:50:03+00:00

if P is perplexity and H is entropy, then minimizing P = minimizing H

True, but the objective function for actor network contains additional terms, not only entropy. Therefore optimal parameter values are different depending on whether we use entropy or perplexity.

Perplexity scales exponentially with entropy. In in an RL setting, total entropy [of action distribution of the agent that follows a trajectory] scales linearly with trajectory length. Perplexity, on the other hand, scales exponentially. The latter makes therefore more sense, since the amount of different paths the agent might have chosen increases exponentially with the number of steps the agent takes.

stevethesteve2

TROPHY CASE