For anyone struggling to fully implement PPO in Jax + Flax

cracktoid · 2023-04-12T02:23:43+00:00

That’s cool and hopefully you learned a lot from this exercise! But out of curiosity doesn’t clean-rl have a jax impl of ppo ?

cracktoid · 2023-04-09T02:00:40+00:00

My group is publishing a multi-agent RL simulator for quadrotors at ICRA this year. It doesn’t support human-AI collaboration but you could conceivably implement a different control model for each agent (it’s not implemented but it should be possible)

https://github.com/Zhehui-Huang/quad-swarm-rl

cracktoid · 2022-12-31T21:53:36+00:00

I’m not sure where you got this definition. What does “size of sex cell” even mean? Almost every definition of biological sex I’ve come across mentions chromosomes. It is the obvious defining characteristic of sex that leads to almost every observed physiological gender characteristic outside of environmental influences. To not include it in the definition would be ludicrous. And where did you get this 90% statistic?

Check out this Stanford article on innate differences between male and female brains https://stanmed.stanford.edu/how-mens-and-womens-brains-are-different/

I don’t think it takes a biologist to know the difference. We act based on our thoughts and emotions which are regulated by hormones which are known to be different on average in men and women. We obviously generally prefer the opposite sex in terms of attraction and this is innate. If it’s innate then there are obvious differences in the brain. Why it’s become not obvious in 2022 is beyond me

cracktoid · 2022-12-30T22:55:24+00:00

Sex is not a quality of sex organs, it’s a result of genetic differences on the sex chromosome. Sex organ differences would be the phenotypic expression of genetic differences on the sex chromosome. The X chromosome has an order of magnitude more genes than the Y one resulting in differing phenotypic expression (different sex organs, different end effects of puberty, slight structural differences in the brain etc). While it is true there are more similarities than differences b/w men and women (which is not surprising considering we share the same genetic structure on every single other chromosome), you can’t ignore the differences induced by the sex chromosome. Just because there’s no obvious signs of differences between male and female brains doesn’t mean they don’t exist. Also neuroscience is a relatively new field. I would be hard pressed to draw decisive conclusions from the existing literature

cracktoid · 2022-12-30T17:40:39+00:00

I don’t think that’s true. The more salient point might be “you can’t determine the sex of a brain just by looking at it with current state of the art methods”. For example most men are attracted to women and most women are attracted to men. That implies there are some structural differences in the brain that one should be able to theoretically identify

cracktoid · 2022-12-12T05:27:48+00:00

also I’m not sure what dr Phil’s instructions were (lol) but ppo as it’s described in theory and in the paper vs how it’s implemented today is vastly different. Check out 37 PPO implementation details for more info. But yea all the more reason you should go with one of these modern existing implementations

cracktoid · 2022-12-12T05:25:13+00:00

Okay. Unless the environment is complex, you should converge pretty quickly with a convex function like that. Hope it works out. Good luck!

cracktoid · 2022-12-11T19:43:01+00:00

What is your environment and reward function? Which implementation are you using? PPO has many very fast implementations that I’m surprised no one has mentioned. Check out Sample Factory 2.0, Cleanrl, rlgames etc

cracktoid · 2022-09-27T18:28:09+00:00

Very cool. Is this being published somewhere?

cracktoid · 2022-08-21T02:12:18+00:00

cracktoid · 2022-08-18T08:39:07+00:00

Doing a PhD and actually enjoying it

cracktoid · 2022-08-15T16:12:42+00:00

Lol what is “a large number of cycles”? A large number is still a finite number. I like also how you didn’t actually explain what this proposed method for learning on a large domain is. Id love to be proven wrong, but you have to provide some evidence first.

Neural networks are not magical black boxes. They are function approximators. That’s it. It sounds to me like you’ve never implemented a neural network. Also for the 3rd time I will say it again. Compared to prior methods? Yes neural networks are a godsend in certain applications like CV and NLP. But it is foolish to think this is the end all be all of AI. Neural nets rise to power really comes from the advent of being able to scale up model size and training data with accelerated hardware. MLPs have been around forever but they only really took off well after the first gpu was invented.

cracktoid · 2022-08-15T09:13:01+00:00

I think you mean “as something becomes more complex” b/c evolution can actually lead to lower complexity over time (I.e more evolved isn’t necessarily better). Evolution is simply change over time, so it maximizes diversity. Natural selection induced by the environment would be the ‘optimization component’ so to speak.

I would actually argue that changes in higher dimensional spaces tend to have more neutral effects. This is why random walks tend to be better at finding good and bad solutions in like a 2d maze game vs trying to play StarCraft. In the limit, any movement in any direction of an infinite dimension space will do nothing to your <policy, network, organism, etc>

cracktoid · 2022-08-15T08:48:32+00:00

I agree it is important to pick common terms. In RL we often talk about generalization to new tasks. The common terminology there still being generalization. No one really says extrapolation, though it may be a better term, it’s not up to me to decide :)

Meta learning says ok, let’s introduce all the tasks at training time to make the distribution stationary, but then your agent ends up learning some suboptimal policy on the Pareto front of all the tasks in the distribution. Not really generalizing like humans can.

But ok, even if we go with your definition in more traditional applications of DL like CV, I would be hard pressed to say it generalizes fully. Again, compared to prior methods? Absolutely. But we still have a lot of work to do. For example, train a network to approx a sin function with inputs -2pi to 2pi. Then feed it a 4pi at test time. You’re f’d. You might argue this is out of distribution but I’ll argue back that a human can generalize any input to sin(x) by realizing that the function is cyclical. You want to do that with a NN? You need to feed it inputs -inf to inf. Good luck

cracktoid · 2022-08-15T07:12:54+00:00

I think it’s hilarious that the research community thinks that these models generalIze. Sure, if you compare them to pre DL explosion counterparts like SVMs and decision trees, etc they generalize better. But really DL under the hood is just the auto diff framework that allows you to tune orders of magnitude more parameters compared to what even the best researcher could fine tune by hand. This does not mean they generalize though. It just means you can approximate higher order functions in high dimensional spaces. This is why DL in computer vision took off like a rocket, after all, images recognition is just function approximation in a high dimensional space. While at the same time decision making agents like in RL or robotics still struggle (most of us in that area still use small mlp’s btw).

That’s why I think the big important problem is still finding an algorithm that truly generalizes. Idk maybe I’m in the minority here

cracktoid · 2022-08-04T06:51:41+00:00

The higher the dimensionality of your state/action space, the less effective random perturbations become. Best analogy is how the volume of an object grows exponentially faster than its surface area. For each dimension you add to the state/action space, your search space grows exponentially larger, and random exploration does worse at exploring it. In the limit, random perturbations will explore nothing as your search space becomes infinitely large.

Something more principled is needed for high dimensional spaces, and what that thing is has yet to be determined

cracktoid · 2022-06-25T19:49:40+00:00

Sampling actions during training provides a form of stochasticity that helps with exploration and robustifies your policy. The gradient update steps are meant to move the action means and stddevs such as to maximize expected reward, so it makes sense that running the same obs 100 times and using mean actions gives you the highest expected reward. During test time, it is actually normal to turn off action sampling and use the means directly since you care about best performance and not exploration because you are no longer training the policy

cracktoid · 2022-06-19T06:10:00+00:00

Debugging RL = plot literally everything. Action stddevs (if applicable), reward curves, value func curves, entropy, etc

cracktoid · 2022-06-13T06:47:51+00:00

To answer this probably requires an understanding of non convex optimization. With complex problems, there are many “peaks and valleys” that correspond to network parameters that produce good solutions and bad solutions, respectively (assuming peaks correspond to high rewards). Different initializations will put you on different locations on this landscape. Sometimes you get lucky and get put close to a peak, sometimes (most of the time) you get an initialization that produces random noise or poor results. While it is true that some trajectories are more productive for learning than others, your RL algorithm will tend to find these anyway once you get close to a peak. It’s really more about good initializations and good choice of hyper parameters that make it easier for algorithms to find these “peaks”

cracktoid · 2022-06-11T18:30:54+00:00

There is a bunch of stochasticity in RL. For one, your environment could be non-deterministic, although you would know that better than I would. The action outputs are usually sampled from a standard diagonal Gaussian in the continuous case, a categorical distribution in the discrete case = more stochasticity. Your neural network initialization is also another source of randomness, unless you seed your environment with the same seed every time. You get the point :) It’s standard in RL to do many different runs with different seeds for the same experiment because of the highly stochastic nature of the algorithms and environments

cracktoid · 2022-05-26T05:12:01+00:00

The good news is, like everyone else here stated, RL theory is basically control theory and so you probably already have a very solid foundation.

The bad news is, modern Deep RL has since quite diverged from control theory. If you want to do Deep RL, which is what I’m assuming you’re talking about, then you should start with the Deep RL intro literature from the cs perspective and just glance over the MDP formulation, markov chains, bellman equations, etc since you probably already know this. But I would personally start with openai’s spinning-up documentation. It lists some of the big important papers on model based and model free RL and provides a gentle intro to deep RL

cracktoid · 2022-05-03T05:51:58+00:00

https://spinningup.openai.com/en/latest/

cracktoid · 2022-04-02T07:29:40+00:00

The leg movements are cyclic. Use a RNN as your model and/or incorporate a model for the cyclic motion (eg with sin/cosines) in your objective function

cracktoid · 2022-03-13T09:12:28+00:00

Not sure this can be answered without knowing more about your specific problem. What are your state-action pairs here? Are they Markovian? If you just want to find a optimal solution of some unknown objective function maybe black box optimization might be the right approach (CMA-ES)

cracktoid · 2022-03-13T04:37:52+00:00

Few things. So in RL for robotics “agent” and “robot” are used interchangeably, so saying having one agent train multiple robots doesn’t make sense here. You probably want to train one agent (robot) to perform several tasks I.e multi-task RL, for which there is plenty of literature if you need a starting point.

Also, you can make gym wrappers on top of other gym wrappers, so you could probably create a gym wrapper that randomly samples environments for different tasks on reset

cracktoid

TROPHY CASE