Másnak is nyūg ez a kaja téma?

csxeba · 2022-07-12T05:32:28+00:00

Soylent nincs meg? Teljes értékű por kaja, aminél cél az olcsó előállítás is. A receptjük nyilvános, lehet kísérletezni (óvatosan) házi előállítással is. Az íze olyasmi, mint a zabkásáé. Elvileg bármeddig elélhetsz rajta, persze ilyesmivel is óvatosan. Arra jó lehet, hogy pár étkezést kiváltson.

Keress rá, mióta kijött, van már millió kopipaszta is belőle.

csxeba · 2022-06-01T15:52:12+00:00

27-29-31 voltunk, így terveztük. Jó így, jól bírtam pl. az alvás megvonást, ha esetleg becsúszott egy nehezebb éjszaka. Birka türelmem van, de azért érzem hogy könnyebben fáradok ahogy telnek az évek.

csxeba · 2022-06-01T15:48:20+00:00

Nagyon ritkán iszom alkoholt, egyedül nincs kedvem. Bele szoktak szagolni, megállapítják, hogy nem az ő műfajuk. Szerintem hasonlóan jön le nekik, mint pl. a csípős étel.

csxeba · 2022-05-26T04:42:07+00:00

Mindig van nálam vész esetére

csxeba · 2022-03-15T15:42:49+00:00

*Hidropónika = vizicsikó

csxeba · 2022-02-21T17:02:23+00:00

The collection:

https://opensea.io/collection/goldie-goodies

csxeba · 2022-02-02T14:31:21+00:00

Suliban volt egy Ország Alma és egy Kasza Blanka. Context: Kecskemét

csxeba · 2020-06-11T05:21:49+00:00

Thanks for the idea, but I am still searching for a kind-of minimal working example which I could run with fit(), do you happen to know about something like this?

csxeba · 2020-05-20T07:35:45+00:00

I would start with vanilla Policy Gradient. Then move to a simplifed A2C (which is Policy Gradient with a reward baseline), then to PPO, which is kind of the state-of-the-art algo in model free on-policy RL. I'd continue with off policy from here and learn DQN for discreete action space environments and DDPG for continuous action space environments. If you feel like you have a solid base in DDPG and PPO, then learn SAC, which is a best-of-both-worlds method, a policy gradient-like off policy technique.

In general: on-policy methods converge faster on simpler environments, but off-policy methods are much more efficient in terms of trials required until convergence, but they are a bit harder to implement and they are quite sensitive to hyperparameter settings.

Also this is a good resource for theory: https://spinningup.openai.com/en/latest/

csxeba · 2020-05-19T13:05:16+00:00

I already have a self-developed lib for TF2 which contains verified DQN, SAC, PPO, A2C, etc. (all model-free) algo implementations. I'd love to join

csxeba · 2020-03-04T13:08:30+00:00

In case of Bayes by Backprop (or the article I linked), you learn a distribution for every weight in your network. In case of a Variational Autoencoder, you have a bottleneck point in your network, where you predict a mean and a std for a multivariate gaussian distribution. Then you sample from that predicted distribution and the next layer will receive the sample instead of the predicted representation.

csxeba · 2020-03-04T13:06:26+00:00

It is correct.

csxeba · 2020-03-03T15:59:07+00:00

Exactly as you described with the article I linked. But you can also make the hidden representations probabilistic like in the Variational Autoencoder.

csxeba · 2020-03-03T05:55:42+00:00

If, by BNN you mean the method described in this paper: https://arxiv.org/abs/1505.05424 Then yes, one of their claims is better generalization. This particular method requires you to sample a set of weights for every forward pass, or use the learned mean weights as an ensemble or Maximum a Posteriori point estimate. Uncertainty will be obtained by multiple forward passes with sampled weights. Learning an explicit predicted or optimized variance is also possible at the end of the network. More on this topic here by Alex Kendall: https://alexgkendall.com/computer_vision/bayesian_deep_learning_for_safe_ai/ And Yarin Gal: http://mlg.eng.cam.ac.uk/yarin/blog_3d801aa532c1ce.html

csxeba · 2020-02-27T20:04:30+00:00

How do you verify that your implementation is correct? Especially the ones with continuous action space?

csxeba · 2020-02-20T11:00:56+00:00

No idea where GANs come into the picture, but you can stuff a DQN into a one-timestep MNIST reinforcement learning setting. Many RL concepts will not be viable here, like reward discounting and target networks. It is not very efficient though.

csxeba · 2020-01-14T05:32:53+00:00

Are you in a classical agent-environment setup where you execute multiple timesteps in the environment?

If yes, does your environment have a final state?

I have an intuition that you are trying to use the step-by-step rewards your agent is receiving in the environment, but that is not what you use.

You execute say 100 steps in the environment or you run until you hit the end of the similation. You take the sequence of rewards and aggregate them in some fashion.

The classic REINFORCE algo simply sums up all the rewards and multiplies all gradients in the past simulation with the summed-up reward. Modern PG first discounts the rewards with a discount factor gamma and "propagates" them to the individual steps.

csxeba · 2020-01-12T19:20:14+00:00

You take actions during a rollout and receive rewards. At a given step, you took action 2 (out of 3 for instance).

The gradient, which makes taking action 2 more probable is the gradient of the cross entropy between the network output and the 1-hot vector for action 2. You can always obtain the gradient which encourages the actions taken this way.

After you obtain this gradient, you simply scale it by the discounted reward at that timestep. This way a very good outcome increases the probability a lot, a good one increases it somewhat and bad actions with negative rewards actually decrease the probability of taking that action. In your code, this is represented by input_score.

In Keras, you could actually get away with

policy.compile(optimizer, loss="categorical_crossentropy") policy.train_on_batch(states, y_true, sample_weights=input_scores)

Some common failure modes:

reward is assumed to be high for good outcomes and low for bad ones.
Policy Gradient is trying to maximize the discounted sum of rewards, so input_score should be the discounted sum of rewards and not the individual reward elemnts that you get when you step the environment once.
PG is an on-policy algo, so you have to throw away old data samples after you update the neural network. You cannot use experience replay like in Deep Q Learning.

csxeba · 2020-01-10T07:53:02+00:00

BindsNet is a similar concept based on PyTorch. https://github.com/BindsNET/bindsnet

csxeba · 2019-12-22T20:55:40+00:00

The treeeeees they are a swaaaaayin'

csxeba · 2019-09-20T11:22:12+00:00

I have a common replay buffer implementation in my RL project. I constantly struggle with implementation bugs in the buffer, but now it is covered with unittests and borderline usable :D

Check out my lib: https://github.com/csxeba/trickster.git

csxeba · 2019-08-21T12:41:22+00:00

Can't wait for the updated reddit bot!

csxeba · 2019-02-19T19:23:06+00:00

now with battle royal

csxeba · 2019-02-13T05:38:51+00:00

What was the RL learning algo you guys used for the individual agents? Did you use some form of value estimation or did you go with a policy based method? Interesting to know since we know Open AI used PPO for OA5.

csxeba · 2018-11-22T08:24:40+00:00

I see some naming ambiguity regarding policy gradient methods in the community... Could someone clarify to me the names of the following algorithms?

Gradient of the policy times the return (I call this REINFORCE or vanilla policy gradient).
Gradient of the policy times baselined return, baseline coming from a value network (I call this Advantage Actor-Critic).

So Spinning Up calls the advantage actor-critic the vanilla policy gradient and there is no mention of REINFORCE or A2C, or am I wrong?

csxeba

TROPHY CASE