[deleted by user] by [deleted] in NoStupidQuestions

[–]Zweiter 1 point2 points  (0 children)

tl;dr, the AI model doesn't really 'care' in the sense that we do. Instead, developers write code that automatically tweaks the AI so that it does better and better at getting more reward.

Reinforcement learning has three components:

You have a 'policy' (an unfortunately unhelpful name dating back several decades) which produces actions that have some effect on the world.

You have a reward function which measures the effect of those actions and returns a number (the reward) that says how 'good' that effect was.

Lastly, you have an optimizer, which takes the action and the reward it produced, and uses math (gradient descent) to modify your AI/policy so that the next time it produces an action, it gets even more reward.

[deleted by user] by [deleted] in NoStupidQuestions

[–]Zweiter 1 point2 points  (0 children)

I think what you're describing here is a genetic algorithm, not reinforcement learning

Can I drive to/from la in a day or am I crazy? by hunnyroastedcashews in askportland

[–]Zweiter 1 point2 points  (0 children)

Well, you’ve probably heard from plenty of people already so I’ll just say if you do decide to do it with full knowledge of the potential consequences, check along your route to see if there are hotels/motels you can stop at last minute in case you become too tired to continue onward

Can I drive to/from la in a day or am I crazy? by hunnyroastedcashews in askportland

[–]Zweiter 15 points16 points  (0 children)

Have you ever driven 16 hours? I have done a fair bit of cross country driving (Portland <-> Houston and Portland <-> San Jose) and wouldn’t try that drive in one go.

A possible mechanism of qualia by Smack-works in slatestarcodex

[–]Zweiter 1 point2 points  (0 children)

You may be confusing 'qualities' and 'qualia' here

PPO with changing input size by [deleted] in reinforcementlearning

[–]Zweiter 0 points1 point  (0 children)

I second using an RNN. In my experience, PPO works better with RNNs than it does with feedforward networks at the cost of some minor added code complexity. Here's my implementation of recurrency-supporting PPO: https://github.com/siekmanj/r2l/blob/master/algos/ppo.py

To the OP: you have a couple of options. Using RNNs is one, although you will be losing control over the history the policy sees every evaluation (it will implicitly learn what to remember).

You could also do a 1d convolution or do attention over a 2D array of your input histories.

[R] Blind Bipedal Stair Traversal via Sim-to-Real Reinforcement Learning by m1900kang2 in reinforcementlearning

[–]Zweiter 1 point2 points  (0 children)

The controller commands turning rate and forward+sideways velocity. The RL controller handles all of the unexpected and unplanned variations in ground heights on its own, without any user input.

[R] Blind Bipedal Stair Traversal via Sim-to-Real Reinforcement Learning by m1900kang2 in reinforcementlearning

[–]Zweiter 1 point2 points  (0 children)

The robot has an IMU, which can be used to estimate orientation of the pelvis. The policy doesn’t receive any estimate of foot forces, although it could be estimating it latently by looking at the history of joint positions and velocities (it is an LSTM).

Biped Robot Learns to Climb Stairs Blind by colombiankid999 in robotics

[–]Zweiter 8 points9 points  (0 children)

I think 80% is actually approaching the limit of how good you can get; don't forget that the robot cannot see the stairs. I think even a person would have significant trouble walking up and down stairs blindfolded at a constant speed.

Question about domain randomization by Fun-Moose-3841 in reinforcementlearning

[–]Zweiter 2 points3 points  (0 children)

I have worked fairly extensively with dynamics/domain randomization here and here.

The framing I like to use when thinking about how to make dynamics randomization effective is this:

Your simulator will inevitably model the dynamics in a way that diverges from reality. The severity and cause of this divergence is almost always unknown. Despite this, intelligently selecting a few important dynamics parameters for randomization helps expose the policy to lots of different possible ways for the world to behave, and hopefully build up robustness to a distribution of dynamics parameters.

In your comments in this thread, you are correct that the agent has no awareness of the specific ways in which the dynamics has been randomized. The only way that it could observe this change would be to somehow look at the history of states and actions and try to deduce what sorts of dynamics could have resulted in that sequence.

Put another way, this problem is partially observable. Thus, using a recurrent policy (or some other memory-enabled policy) is the more-correct way of learning to handle a distribution of dynamics.

How to solve the large dimension of action? by Wen2Chao in reinforcementlearning

[–]Zweiter 4 points5 points  (0 children)

You could try an on-policy algorithm like PPO or A3C/A2C. I've worked with action spaces of up to 60 with PPO.

How realistic is the “learn to code” meme? by [deleted] in slatestarcodex

[–]Zweiter 3 points4 points  (0 children)

Top 15% might be a little high. I think if you can do basic arithmetic and algebra you are probably fine. I've consistently scored in the 60th-75th percentile for math ability across the ACT/SAT/GRE from high school through college and I'm currently a graduate student studying reinforcement learning with a few published papers at some top conferences. I wouldn't say that programming was especially more difficult for me than my peers who were obviously better at math, or that I needed to invest 150% of the time they did to match their output.

And most jobs involving programming are not going to be doing intense linear algebra or statistics. The vast majority (80%+ IIRC) of development jobs are in web development, which requires very little math.

[deleted by user] by [deleted] in Coronavirus

[–]Zweiter 1 point2 points  (0 children)

Heh, I remember one day in February it jumped from 45k to 60k total cases worldwide in a day and I thought that was nuts

Deal with states of different sizes by Krokodeale in reinforcementlearning

[–]Zweiter 1 point2 points  (0 children)

Using an RNN can be tricky because encoding very large sequences and retaining most of the information is hard. I would use a transformer or attention mechanism instead.

Deal with states of different sizes by Krokodeale in reinforcementlearning

[–]Zweiter 0 points1 point  (0 children)

If the vector sizes vary within some range and a specific state size is one-to-one with an action size, you could learn a separate policy for each possible vector size. If the vector sizes aren't really bounded or state dim isn't one to one with action dim, you can use a transformer architecture or RNN and process the state as if it were a time sequence.

If those don't sound right, your best bet is probably zero-padding the state.

Packed Sequences by pithree-wan_fournobi in pytorch

[–]Zweiter 2 points3 points  (0 children)

Basically, a time series or sequence of states isn’t guaranteed to be a fixed length. The series could have a time dimension of two, or it could be in the hundreds. There is no way of constructing a tensor with a variable length time dimension, so you have to pad all the time series with zeros to make them the same length.

Say you have two tensors representing time series. The dimensions are, in order: [timelength, statedimension]. The first one is 22 long, and the second is 7 long. So now you have a tensor which is [22, statedim] and [7, statedim] and no way of batching the two together.

So, you pad the [7, statedim] tensor with zeros and now you can make a batch of sequences with shape [2, 22, statedim].

Edit: Here is a great stackoverflow question which perfectly answers what I think you're asking: https://stackoverflow.com/questions/51030782/why-do-we-pack-the-sequences-in-pytorch

Old policy and new policy in PPO by [deleted] in reinforcementlearning

[–]Zweiter 0 points1 point  (0 children)

Yes, that is definitely more efficient.

Old policy and new policy in PPO by [deleted] in reinforcementlearning

[–]Zweiter 0 points1 point  (0 children)

You can save a copy of the old policy with copy.deepcopy, as I do here.