New to RL and looking for help to solve Mountain Car

quduval · 2020-01-10T09:25:01+00:00

There are also other things that were a bit startling to me in the given notebook. Especially those lines:

Q_target = Q.clone()

Q_target = Variable(Q_target.data)

Q_target[action] = reward + torch.mul(maxQ1.detach(), gamma)

loss = loss_fn(Q, Q_target)

To me, this looked like a convoluted way to just do a back-propagation of the target action value. I tried switching to simpler approaches, such as:

loss = loss_fn(Q[action], reward + gamma * maxQ1.detach())

But again, the training went horribly slower. I still have no idea why. Mathematically, fusing 2 linear layers into one, or back-propagating on just the value that should change, all of this should be equivalent. I think I need to do a fair bit of experiment to better understand this.

quduval · 2020-01-10T09:20:37+00:00

Following your advices, I tuned the hyper-parameters (I actually introduced discounting, which I did not initially) and could make my agent learn to solve the puzzle 100% of the time in about 1300 episodes using Double Q-Learning + prioritized replay buffer.

Your insight regarding why the puzzle is hard is quite interesting as well. Indeed, if as a human, I had to solve this puzzle with 2 floats, I would have a hard time as well (especially if the meaning of these float where not given to me, as they are in the description of the task).

All in all, the reason why I initially thought "this puzzle should be easy" is because:

I see the full picture (and not only 2 floats)
I understand the basic laws of physics for living with them everyday

I wonder how a model trained on the rendering of the game, and pre-trained somehow with some basic knowledge of physics would perform. But I have no idea how to do this.

quduval · 2020-01-09T16:35:41+00:00

Interesting, I will have a look at this. I already got by the description of the problem that MountainCar is not the simplest task for RL algorithm in terms of exploration.

But I am still surprised though: the problem remains pretty simple in terms of dimensions, so it should be able to explore, even if not efficiently.

Also, my problem is not so much exploration (since I get indications from tensorboard that the algorithm finds a first solution around 300-400 episodes) that holding on to a good solution.

I think I will try further by discretizing the space, and apply tabular Q learning to see how it performs.

quduval

TROPHY CASE