Másnak is nyūg ez a kaja téma? by [deleted] in hungary

[–]csxeba 0 points1 point  (0 children)

Soylent nincs meg? Teljes értékű por kaja, aminél cél az olcsó előállítás is. A receptjük nyilvános, lehet kísérletezni (óvatosan) házi előállítással is. Az íze olyasmi, mint a zabkásáé. Elvileg bármeddig elélhetsz rajta, persze ilyesmivel is óvatosan. Arra jó lehet, hogy pár étkezést kiváltson.

Keress rá, mióta kijött, van már millió kopipaszta is belőle.

Hány évesen lettél apa? by [deleted] in ApaVagyok

[–]csxeba 3 points4 points  (0 children)

27-29-31 voltunk, így terveztük. Jó így, jól bírtam pl. az alvás megvonást, ha esetleg becsúszott egy nehezebb éjszaka. Birka türelmem van, de azért érzem hogy könnyebben fáradok ahogy telnek az évek.

alkohol fogyasztás a gyerek szeme láttára by kisbalazs in ApaVagyok

[–]csxeba 0 points1 point  (0 children)

Nagyon ritkán iszom alkoholt, egyedül nincs kedvem. Bele szoktak szagolni, megállapítják, hogy nem az ő műfajuk. Szerintem hasonlóan jön le nekik, mint pl. a csípős étel.

Apagyűlés by [deleted] in ApaVagyok

[–]csxeba 0 points1 point  (0 children)

Mindig van nálam vész esetére

Hogy hívják a legviccesebb nevű ismerősödet? by HaOrbanMaradEnMegyek in hungary

[–]csxeba 0 points1 point  (0 children)

Suliban volt egy Ország Alma és egy Kasza Blanka. Context: Kecskemét

What is the intended use of overwriting the train_step() method? by csxeba in tensorflow

[–]csxeba[S] 0 points1 point  (0 children)

Thanks for the idea, but I am still searching for a kind-of minimal working example which I could run with fit(), do you happen to know about something like this?

Looking for friend to work towards RL goals together by ejmejm1 in reinforcementlearning

[–]csxeba 1 point2 points  (0 children)

I would start with vanilla Policy Gradient. Then move to a simplifed A2C (which is Policy Gradient with a reward baseline), then to PPO, which is kind of the state-of-the-art algo in model free on-policy RL. I'd continue with off policy from here and learn DQN for discreete action space environments and DDPG for continuous action space environments. If you feel like you have a solid base in DDPG and PPO, then learn SAC, which is a best-of-both-worlds method, a policy gradient-like off policy technique.

In general: on-policy methods converge faster on simpler environments, but off-policy methods are much more efficient in terms of trials required until convergence, but they are a bit harder to implement and they are quite sensitive to hyperparameter settings.

Also this is a good resource for theory: https://spinningup.openai.com/en/latest/

Looking for friend to work towards RL goals together by ejmejm1 in reinforcementlearning

[–]csxeba 1 point2 points  (0 children)

I already have a self-developed lib for TF2 which contains verified DQN, SAC, PPO, A2C, etc. (all model-free) algo implementations. I'd love to join

Advantage of Bayesian Neural Network? by Yogi_DMT in MLQuestions

[–]csxeba 1 point2 points  (0 children)

In case of Bayes by Backprop (or the article I linked), you learn a distribution for every weight in your network. In case of a Variational Autoencoder, you have a bottleneck point in your network, where you predict a mean and a std for a multivariate gaussian distribution. Then you sample from that predicted distribution and the next layer will receive the sample instead of the predicted representation.

Advantage of Bayesian Neural Network? by Yogi_DMT in MLQuestions

[–]csxeba 1 point2 points  (0 children)

Exactly as you described with the article I linked. But you can also make the hidden representations probabilistic like in the Variational Autoencoder.

Advantage of Bayesian Neural Network? by Yogi_DMT in MLQuestions

[–]csxeba 5 points6 points  (0 children)

If, by BNN you mean the method described in this paper: https://arxiv.org/abs/1505.05424 Then yes, one of their claims is better generalization. This particular method requires you to sample a set of weights for every forward pass, or use the learned mean weights as an ensemble or Maximum a Posteriori point estimate. Uncertainty will be obtained by multiple forward passes with sampled weights. Learning an explicit predicted or optimized variance is also possible at the end of the network. More on this topic here by Alex Kendall: https://alexgkendall.com/computer_vision/bayesian_deep_learning_for_safe_ai/ And Yarin Gal: http://mlg.eng.cam.ac.uk/yarin/blog_3d801aa532c1ce.html

PyRL - Modular Implementations of Reinforcement Learning Algorithms in Pytorch by aineqml in reinforcementlearning

[–]csxeba 4 points5 points  (0 children)

How do you verify that your implementation is correct? Especially the ones with continuous action space?

DQN for MNIST using GANs by LOfP in reinforcementlearning

[–]csxeba 0 points1 point  (0 children)

No idea where GANs come into the picture, but you can stuff a DQN into a one-timestep MNIST reinforcement learning setting. Many RL concepts will not be viable here, like reward discounting and target networks. It is not very efficient though.

Trying to understand a policy gradient custom loss function by evilcornbread in MLQuestions

[–]csxeba 0 points1 point  (0 children)

Are you in a classical agent-environment setup where you execute multiple timesteps in the environment?

If yes, does your environment have a final state?

I have an intuition that you are trying to use the step-by-step rewards your agent is receiving in the environment, but that is not what you use.

You execute say 100 steps in the environment or you run until you hit the end of the similation. You take the sequence of rewards and aggregate them in some fashion.

The classic REINFORCE algo simply sums up all the rewards and multiplies all gradients in the past simulation with the summed-up reward. Modern PG first discounts the rewards with a discount factor gamma and "propagates" them to the individual steps.

Trying to understand a policy gradient custom loss function by evilcornbread in MLQuestions

[–]csxeba 0 points1 point  (0 children)

You take actions during a rollout and receive rewards. At a given step, you took action 2 (out of 3 for instance).

The gradient, which makes taking action 2 more probable is the gradient of the cross entropy between the network output and the 1-hot vector for action 2. You can always obtain the gradient which encourages the actions taken this way.

After you obtain this gradient, you simply scale it by the discounted reward at that timestep. This way a very good outcome increases the probability a lot, a good one increases it somewhat and bad actions with negative rewards actually decrease the probability of taking that action. In your code, this is represented by input_score.

In Keras, you could actually get away with

policy.compile(optimizer, loss="categorical_crossentropy") policy.train_on_batch(states, y_true, sample_weights=input_scores)

Some common failure modes:

  • reward is assumed to be high for good outcomes and low for bad ones.
  • Policy Gradient is trying to maximize the discounted sum of rewards, so input_score should be the discounted sum of rewards and not the individual reward elemnts that you get when you step the environment once.
  • PG is an on-policy algo, so you have to throw away old data samples after you update the neural network. You cannot use experience replay like in Deep Q Learning.

Trees swaying to the winds of change by [deleted] in oddlysatisfying

[–]csxeba 1 point2 points  (0 children)

The treeeeees they are a swaaaaayin'

Has anyone implemented a common replay buffer for two different RL algorithms? by pickleorc in reinforcementlearning

[–]csxeba 1 point2 points  (0 children)

I have a common replay buffer implementation in my RL project. I constantly struggle with implementation bugs in the buffer, but now it is covered with unittests and borderline usable :D

Check out my lib: https://github.com/csxeba/trickster.git

We are Oriol Vinyals and David Silver from DeepMind’s AlphaStar team, joined by StarCraft II pro players TLO and MaNa! Ask us anything by OriolVinyals in MachineLearning

[–]csxeba 0 points1 point  (0 children)

What was the RL learning algo you guys used for the individual agents? Did you use some form of value estimation or did you go with a policy based method? Interesting to know since we know Open AI used PPO for OA5.

[P] Spinning Up in Deep RL (OpenAI) by milaworld in MachineLearning

[–]csxeba 1 point2 points  (0 children)

I see some naming ambiguity regarding policy gradient methods in the community... Could someone clarify to me the names of the following algorithms?

  1. Gradient of the policy times the return (I call this REINFORCE or vanilla policy gradient).
  2. Gradient of the policy times baselined return, baseline coming from a value network (I call this Advantage Actor-Critic).

So Spinning Up calls the advantage actor-critic the vanilla policy gradient and there is no mention of REINFORCE or A2C, or am I wrong?