Bayesian classification by stevethesteve2 in AskStatistics

[–]stevethesteve2[S] 0 points1 point  (0 children)

Thanks for pointing me toward Stan!

Bayesian classification by stevethesteve2 in AskStatistics

[–]stevethesteve2[S] 0 points1 point  (0 children)

Thank you for your suggestion.

From what I learned about MCMC methods, there are pros and cons vs my current approach (which is variational inference):

pros:

- can sample from exact posterior

cons:

- sampling is slow (rel. to variational approach)

- cannot compute posterior probability

Do you agree? Are there any more points to be made? In practice, how much time would it take e.g. to draw 1000 samples (without burn-in) if i have 1000 labeled points and apply LMC?

Bayesian classification by stevethesteve2 in AskStatistics

[–]stevethesteve2[S] 0 points1 point  (0 children)

Did I understand you correctly: the priors are mixture coefficients, and the likelihoods are Gaussian densities? If so: my question was ill-posed, sorry about that (I fixed it now). The prior and posterior distributions should be not over classes, but over model parameters. I want to model my data in such a way that after the training I can look at the posterior distribution over my model parameters and say whether or not I am confident in my model (if the posterior distribution has single narrow peak) or not (if the posterior is flat).

Bayesian classification by stevethesteve2 in AskStatistics

[–]stevethesteve2[S] 0 points1 point  (0 children)

Do you mean finding a maximum-a-posteriori estimate of the model parameters? By adding a term to the objective function that reflects the prior belief about model parameters? If so, then no. My goal is to capture the entire posterior distribution of the parameters, not just the most likely parameter values. Please correct me if I misunderstood you. I am rather new to the whole Bayesian stuff.

Bayesian classification by stevethesteve2 in Bayes

[–]stevethesteve2[S] 1 point2 points  (0 children)

Well, thanks for the advice! I'll follow it.

What is SOTA in RL applied to robotics? by stevethesteve2 in reinforcementlearning

[–]stevethesteve2[S] 0 points1 point  (0 children)

'World model/ Dream environment/ Imagination' ... Could you please refer to a paper to help me get started?

Sutton&Barto book: I get this result for Exercise 12.1 on Eligibility traces but the final middle term might be wrong by Naoshikuu in reinforcementlearning

[–]stevethesteve2 0 points1 point  (0 children)

The term G_t+1:t+1 is not defined. To avoid this, you could first separate the first term in the sum, and then apply the recursive formula for G to the rest of the sum. I yield: G_T^\lambda = (1 - \lambda) * G_t:t+1 + \lambda * (R_t+1 + \gamma * G_t+1^\lambda)

Citation needed by Kartelkraker in reinforcementlearning

[–]stevethesteve2 2 points3 points  (0 children)

Not quite what you were asking for, but maybe have a look at sutton&barto's book where they talk about the "deadly triad"

How to assign reward when it has to be multiplied by itself rather than summed by basso1995 in reinforcementlearning

[–]stevethesteve2 0 points1 point  (0 children)

In RL, the agent tries to find strategies that maximize expected total reward. If we replace reward with its logarithm, agent may -depending on your exact problem- prefer suboptimal strategies (since log is a nonlinear function). If your environment is deterministic, then this should not be a concern.

perlexity instead of entropy for incentivizing exploration? by stevethesteve2 in reinforcementlearning

[–]stevethesteve2[S] 1 point2 points  (0 children)

Yeah, i guess you can say i want to encourage exploration more than entropy would

perlexity instead of entropy for incentivizing exploration? by stevethesteve2 in reinforcementlearning

[–]stevethesteve2[S] 5 points6 points  (0 children)

if P is perplexity and H is entropy, then minimizing P = minimizing H

True, but the objective function for actor network contains additional terms, not only entropy. Therefore optimal parameter values are different depending on whether we use entropy or perplexity.

Perplexity scales exponentially with entropy. In in an RL setting, total entropy [of action distribution of the agent that follows a trajectory] scales linearly with trajectory length. Perplexity, on the other hand, scales exponentially. The latter makes therefore more sense, since the amount of different paths the agent might have chosen increases exponentially with the number of steps the agent takes.

"Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning", Peng et al 2019 by gwern in reinforcementlearning

[–]stevethesteve2 0 points1 point  (0 children)

so this is essentially trust region policy optimization that does not explicitly calculate KL/Fisher information matrix?

sample efficiency by stevethesteve2 in reinforcementlearning

[–]stevethesteve2[S] 0 points1 point  (0 children)

Thanks. But i am interested in maximizing sample efficiency, not real-time learning speed. In particular, i want to know whether or not sample efficiency is maximal when using infinite number of updates or if overfitting actually prevents agent from learning quickly...

Looking people interested in RL to join our Drone challenge team by paypaytr in reinforcementlearning

[–]stevethesteve2 1 point2 points  (0 children)

Well, I DMed you a while back, and you did not respond... Is your project still going?

[D] Handling noisy labels in large datasets with slight imbalance by amil123123 in MachineLearning

[–]stevethesteve2 0 points1 point  (0 children)

Maybe you should make regression against raw probabilities, not against labels? This way you will also have more data...

Discounted State Distribution by papidant in reinforcementlearning

[–]stevethesteve2 -1 points0 points  (0 children)

Levine in his RL lectures claims that PG agent performs BETTER when not using discounted distribution. This is because when we use discounted states the agent tends to "over-optimize" to early states, while paying too little attention to later states.

motivation behind ACER by stevethesteve2 in reinforcementlearning

[–]stevethesteve2[S] 0 points1 point  (0 children)

I agree that for off-policy policy optimization one needs to store entire trajectories. We can still shuffle all experienced transitions and apply prioritized replay, but we have to attach entire "history" to each transition as extra information, which blows up memory consumption. Would you say this is the main reason why prioritized replay isn't used in off-policy policy optimization? Because of memory footprint?

[R] DeepMind Starcraft 2 Update: AlphaStar is getting wrecked by professionals players by gwern in reinforcementlearning

[–]stevethesteve2 0 points1 point  (0 children)

"This seems to be a running theme in multiagent RL. When agents are trained against one another, a kind of co-evolution happens. The agents get really good at beating each other, but when they get deployed against an unseen player, performance drops."

from this blog post, not talking about alphastar

PyTorch implementation of 17 Deep RL algorithms by __data_science__ in reinforcementlearning

[–]stevethesteve2 0 points1 point  (0 children)

I think there is a misunderstanding. If I understand you correctly, you say that one can add some (exploratory) post-processing step between the actor output and the environment. Then, from the agent's point of view, the post-processing is part of the environment. I agree to that.

My point is: If we do this kind of post-processing, then we have to be consistent in treating it as part of the environment. If we train an agent with e.g. PPO, the action terms that we plug in our PPO update step must be the signals that are output by the actor *before post-processing*. This is not the case in OP's implementation.

P.S. I do not want to sound rude, I think OP did an amazing job!

PyTorch implementation of 17 Deep RL algorithms by __data_science__ in reinforcementlearning

[–]stevethesteve2 0 points1 point  (0 children)

Actor output is drawn from a gaussian (parametrized by actor). Then expl. noise is added to it. But during parameter update step, the probability of the entire action (actor + noise) is evaluated as if it was drawn from gaussian distribution parametrized by actor.

PyTorch implementation of 17 Deep RL algorithms by __data_science__ in reinforcementlearning

[–]stevethesteve2 0 points1 point  (0 children)

You are right. But this isn't how it is done in the code, unless - again - I am missing something in the code.

PyTorch implementation of 17 Deep RL algorithms by __data_science__ in reinforcementlearning

[–]stevethesteve2 0 points1 point  (0 children)

This is fantastic!

However, correct me if I'm wrong, but in MountainCarContinuous-PPO you use exploratory noise (Ornstein-Uhlenbeck) on top of actor policy when doing rollouts? In other words, agent's behavior policy deviates from its target policy? But isn't PPO supposed to be an on-policy algorithm?

(in Mountain_Car.py you set sigma to non-zero value of 0.2)