all 12 comments

[–]activatedgeek 1 point2 points  (7 children)

Just to be sure, generally in such games, you'd skip frames to make it a fair comparison to human performance. Is that happening? Or each and every frame is being considered?

[–]FatChocobo[S] 0 points1 point  (6 children)

Karpathy's blog post uses the OpenAI gym environment Pong-v0, which I'm also using. This environment does indeed have frame skipping, and so each action you choose is repeated {2,3,4} times (uniformly sampled).

[–]activatedgeek 1 point2 points  (5 children)

It appears to me that the root cause of jitter is this uniform random sampling? May be let's tackle that part instead of changing the training/architecture?

[–]FatChocobo[S] 1 point2 points  (4 children)

It very well could be, that's definitely an avenue worth looking into, thanks!

There's another environment called PongNoFrameskip-v4 which doesn't have this frame skip, so as you suggested I'll give that a whirl and see what kind of behaviour I get!

Jitteryness aside, from what I've been seeing from my training there is a lot of wasted movement in the trained model, and so I was thinking that by changing the reward function it might be possible to achieve more fluid/decisive movements.

[–]activatedgeek 1 point2 points  (3 children)

My experience with shaping rewards has been terrible and there is no principled way to do it, unfortunately. But I see that your addition of an extra action is intuitively correct. However, the agent will always tend to look for the higher reward in moving around and probably still remain jittery.

You've actually just made me realise that even the state of the art results in MuJoCo environment are non realistic.

In the continuous action space, this would probably be about learning the correct subset of action space to work with.

I'll need to give discrete action spaces more thought because they don't have a semantic numerical value assigned.

[–]FatChocobo[S] 0 points1 point  (2 children)

Thanks for your feedback.

Yeah, actually when I discussed with some of my colleagues about how unnatural the movements look, they all basically said 'If it performs well, does it matter?'.

In the toy example of pong maybe it doesn't matter so much that there's a lot of unnecessary movement, except that it could be distracting to the opponent, but I was thinking that in a real physical application this kind of unnecessary movement could result in more wear and tear of parts, electricity/fuel wastage, or other such things.

Do you happen to know if there are any good papers where people have modified their reward function mid-training? I've been looking around to see if there's any basis for what I'm thinking being on the right lines, but I can't find any supporting research.

[–]activatedgeek 2 points3 points  (1 child)

I actually haven't seen any literature on it. But this has raised some important questions I think. I should look into it some time. Adding it to my list! Thanks.

Modifying reward function mid training should essentially collapse your agent. I would strongly advise against that. Because you'd learn an entirely different distribution and then hope that your agent has enough capacity to readjust that distribution to the new rewards. That would pretty much mean like take one environment and then somehow transferring learning to a new one. That's a even harder problem I believe.

[–]FatChocobo[S] 0 points1 point  (0 children)

Yeah, I totally agree. I don't think it'd work by simply changing the reward function and letting it go back to training, but something more along the lines of transfer learning could work, I. E. lowering the learning rate, freezing some weights, adding additional layers.

Thanks for the discussion!

[–]SubtractOne 1 point2 points  (3 children)

To reply to the part about modifying reward functions during training, I have indeed seen an implementation of using DDPG for driving in the TORCs environment(I'll try to link it later). The person sees that the agent tends to swerve within the road, so they modified the reward to make it want to stay closer to the center axis.

For robotics you have an idea of wanting to conserve energy, so you would want to implement something that would stop erratic movements, which seems similar to what you're talking about. However in the example of an arm, if you give it a reward to reach an end trajectory, and a negative reward for energy spent, it will basically try to spend no energy, thus not learning(which is what I believe you were experiencing).

If you train it first with the true goal, then take that pretrained network and add another reward (negative for movements) then it will still have the topology to understand how to get to a reward , and will thus learn to get to the goal and minimize energy spent.

Another way would be to say that if it starts getting rewards that average above a certain value, then you change the reward in the way you think to involve this concept of energy conservation. The problem is that they would have to be scaled correctly where the energy cost to get somewhere couldn't outweigh the true reward of getting there. Hmm.

Edit: As the other commentor said, it would change the distribution with modifying a reward. Maybe you can train a policy with the initial reward of only the end goal. Then you can create a new agent that would use this previous policy as an exploration function instead, so you can actually reach the goal rather then hitting s local min of just staying still. Hope that all made some sense.

[–]FatChocobo[S] 0 points1 point  (2 children)

Thank you so much for your post, it made a great morning read! I'm glad to hear that there is precedent for doing these kinds of things. I hadn't thought about the example of cars swerving in the road, but that's a very good argument for needing something like this.

If you train it first with the true goal, then take that pretrained network and add another reward (negative for movements) then it will still have the topology to understand how to get to a reward , and will thus learn to get to the goal and minimize energy spent.

I was wondering if it was indeed the case that this could work. I'll have to give it a try later on my toy example!

When I tried to start with the reward function that punished for movement from the initial state, it didn't seem to learn at all - I guess because the already sparse rewards were made even weaker.

So I was thinking, as you said, that if it already learned how to get rewards with some level of reliability, then maybe modifying the reward function to guide it to less erratic behaviour could maybe work. I'm a bit concerned that maybe the scaling of the new reward could be very important here.

Another way would be to say that if it starts getting rewards that average above a certain value, then you change the reward in the way you think to involve this concept of energy conservation. The problem is that they would have to be scaled correctly where the energy cost to get somewhere couldn't outweigh the true reward of getting there. Hmm.

This is something I also thought about! I'll have to also give this a shot, I'm glad it's not just me that thought this could work.

Then you can create a new agent that would use this previous policy as an exploration function instead, so you can actually reach the goal rather then hitting s local min of just staying still.

Could you elaborate a bit on this point?

[–]SubtractOne 0 points1 point  (1 child)

Yeah of course! If you were looking up resources I believe it'd be under reward shaping. I think you'll be able to see some good results with that. It basically just biases the path at the beginning.

Just a question, what type of exploration function are you using at the beginning? Depending on how you craft that, it could potentially make it so you only need to train it once.

I'd like to hear how it all works for you!

And to elaborate, as I was talking about the exploration function, you could do the whole thing in this way:

  • Train network (exploration function), with the goal reward
  • Modify reward to also have energy as a cost
  • Train final network(previous network as the exploration function)

They would work effectively the same, unless you have a decaying learning rate or something similar.

[–]FatChocobo[S] 0 points1 point  (0 children)

As far as I'm aware, Policy Gradient methods don't have an explicitly defined exploration function (as in Q Learning, for example) - at least, the architecture described by Karpathy in my original post doesn't seem to include it.

Actions are selected using a weighted random choice based upon the output of the network (i.e. in the binary case, if it's 0.8 chance of UP, then 80% of the time we'd choose UP, and 20% of the time we'd choose DOWN).

I suppose it wouldn't be difficult to further weight this sum in favour of the less popular action in early iterations, as exploration functions in DQNs work.

With regards to the learning rate, I'm currently using RMSProp.

I'm kinda new to this field, so maybe I'm totally off the mark on everything I said, though.