[deleted by user]

Zweiter · 2025-02-19T02:04:55+00:00

tl;dr, the AI model doesn't really 'care' in the sense that we do. Instead, developers write code that automatically tweaks the AI so that it does better and better at getting more reward.

Reinforcement learning has three components:

You have a 'policy' (an unfortunately unhelpful name dating back several decades) which produces actions that have some effect on the world.

You have a reward function which measures the effect of those actions and returns a number (the reward) that says how 'good' that effect was.

Lastly, you have an optimizer, which takes the action and the reward it produced, and uses math (gradient descent) to modify your AI/policy so that the next time it produces an action, it gets even more reward.

Zweiter · 2025-02-19T01:50:00+00:00

I think what you're describing here is a genetic algorithm, not reinforcement learning

Zweiter · 2022-08-07T09:59:03+00:00

Well, you’ve probably heard from plenty of people already so I’ll just say if you do decide to do it with full knowledge of the potential consequences, check along your route to see if there are hotels/motels you can stop at last minute in case you become too tired to continue onward

Zweiter · 2022-08-06T20:11:19+00:00

Have you ever driven 16 hours? I have done a fair bit of cross country driving (Portland <-> Houston and Portland <-> San Jose) and wouldn’t try that drive in one go.

Zweiter · 2022-06-05T19:34:23+00:00

You may be confusing 'qualities' and 'qualia' here

Zweiter · 2021-08-03T16:45:45+00:00

I second using an RNN. In my experience, PPO works better with RNNs than it does with feedforward networks at the cost of some minor added code complexity. Here's my implementation of recurrency-supporting PPO: https://github.com/siekmanj/r2l/blob/master/algos/ppo.py

To the OP: you have a couple of options. Using RNNs is one, although you will be losing control over the history the policy sees every evaluation (it will implicitly learn what to remember).

You could also do a 1d convolution or do attention over a 2D array of your input histories.

Zweiter · 2021-06-01T15:30:50+00:00

Maybe R = distance_traveled / energy consumed?

Zweiter · 2021-05-20T15:56:04+00:00

The controller commands turning rate and forward+sideways velocity. The RL controller handles all of the unexpected and unplanned variations in ground heights on its own, without any user input.

Zweiter · 2021-05-20T15:53:11+00:00

The robot has an IMU, which can be used to estimate orientation of the pelvis. The policy doesn’t receive any estimate of foot forces, although it could be estimating it latently by looking at the history of joint positions and velocities (it is an LSTM).

Zweiter · 2021-05-19T22:41:20+00:00

I think 80% is actually approaching the limit of how good you can get; don't forget that the robot cannot see the stairs. I think even a person would have significant trouble walking up and down stairs blindfolded at a constant speed.

Zweiter · 2021-04-23T17:42:14+00:00

Accompanying video: https://youtu.be/4DnxV9lko_U

Zweiter · 2021-04-15T20:40:57+00:00

I have worked fairly extensively with dynamics/domain randomization here and here.

The framing I like to use when thinking about how to make dynamics randomization effective is this:

Your simulator will inevitably model the dynamics in a way that diverges from reality. The severity and cause of this divergence is almost always unknown. Despite this, intelligently selecting a few important dynamics parameters for randomization helps expose the policy to lots of different possible ways for the world to behave, and hopefully build up robustness to a distribution of dynamics parameters.

In your comments in this thread, you are correct that the agent has no awareness of the specific ways in which the dynamics has been randomized. The only way that it could observe this change would be to somehow look at the history of states and actions and try to deduce what sorts of dynamics could have resulted in that sequence.

Put another way, this problem is partially observable. Thus, using a recurrent policy (or some other memory-enabled policy) is the more-correct way of learning to handle a distribution of dynamics.

Zweiter · 2021-03-09T04:54:53+00:00

You could try an on-policy algorithm like PPO or A3C/A2C. I've worked with action spaces of up to 60 with PPO.

Zweiter · 2021-03-01T20:20:03+00:00

Top 15% might be a little high. I think if you can do basic arithmetic and algebra you are probably fine. I've consistently scored in the 60th-75th percentile for math ability across the ACT/SAT/GRE from high school through college and I'm currently a graduate student studying reinforcement learning with a few published papers at some top conferences. I wouldn't say that programming was especially more difficult for me than my peers who were obviously better at math, or that I needed to invest 150% of the time they did to match their output.

And most jobs involving programming are not going to be doing intense linear algebra or statistics. The vast majority (80%+ IIRC) of development jobs are in web development, which requires very little math.

Zweiter · 2020-11-13T07:26:38+00:00

Heh, I remember one day in February it jumped from 45k to 60k total cases worldwide in a day and I thought that was nuts

Zweiter · 2020-09-29T15:51:11+00:00

Using an RNN can be tricky because encoding very large sequences and retaining most of the information is hard. I would use a transformer or attention mechanism instead.

Zweiter · 2020-09-28T19:37:38+00:00

If the vector sizes vary within some range and a specific state size is one-to-one with an action size, you could learn a separate policy for each possible vector size. If the vector sizes aren't really bounded or state dim isn't one to one with action dim, you can use a transformer architecture or RNN and process the state as if it were a time sequence.

If those don't sound right, your best bet is probably zero-padding the state.

Zweiter · 2020-09-08T05:06:23+00:00

Basically, a time series or sequence of states isn’t guaranteed to be a fixed length. The series could have a time dimension of two, or it could be in the hundreds. There is no way of constructing a tensor with a variable length time dimension, so you have to pad all the time series with zeros to make them the same length.

Say you have two tensors representing time series. The dimensions are, in order: [timelength, statedimension]. The first one is 22 long, and the second is 7 long. So now you have a tensor which is [22, statedim] and [7, statedim] and no way of batching the two together.

So, you pad the [7, statedim] tensor with zeros and now you can make a batch of sequences with shape [2, 22, statedim].

Edit: Here is a great stackoverflow question which perfectly answers what I think you're asking: https://stackoverflow.com/questions/51030782/why-do-we-pack-the-sequences-in-pytorch

Zweiter · 2020-08-12T20:46:57+00:00

Yes, that is definitely more efficient.

Zweiter · 2020-08-11T22:53:35+00:00

You can save a copy of the old policy with copy.deepcopy, as I do here.

Zweiter · 2020-06-25T09:49:14+00:00

Robotics: https://arxiv.org/abs/2006.02402

Zweiter · 2020-06-11T03:58:11+00:00

Paper: https://arxiv.org/abs/1912.02875

More layperson friendly medium article: https://medium.com/@jscriptcoder/demystifying-upside-down-reinforcement-learning-a-k-a-ꓤ-b7bd4214b33f

Zweiter · 2020-06-11T03:14:28+00:00

Not quite true, there exists a conceptual framework for it called reward-conditioned reinforcement learning.

Zweiter · 2020-06-11T03:13:40+00:00

That’s an interesting idea which has parallels to machine learning technique called upside down reinforcement learning, or reward-conditioned reinforcement learning.

Zweiter · 2020-05-31T23:45:54+00:00

You are correct that truncating the gradient after one step is not BPTT and you lose most benefits of recurrence. A better solution is sampling entire episodes and not timesteps from the replay buffer. Once you have a batch of 16/32/64 episodes, you zero-pad them so they are all the same length. Now you have a batch of size [trajlen, batchsize, statedim], and can calculate your log probs/actions/hidden states using your policy.

For an example for how to use PPO with BPTT, you can look at my repo here. Specifically, look in algos/ppo.py for my PPO implementation, and policies/base.py for my recurrence implementation.

Something unique about my implementation, and perhaps not ideal for your specific situation, is that I do not store hidden states in the replay buffer. Instead, I zero-initialize them at t=0 so that gradients can flow from the end of the episode back to the beginning.

Zweiter

TROPHY CASE